espnet2.train.preprocessor.CommonPreprocessor

About 3 min

espnet2.train.preprocessor.CommonPreprocessor

class espnet2.train.preprocessor.CommonPreprocessor(train: bool, use_lang_prompt: bool = False, use_nlp_prompt: bool = False, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, force_single_channel: bool = False, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, aux_task_names: Collection[str] | None = None, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: str = 'text', fs: int = 0, nonsplit_symbol: Iterable[str] | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, min_sample_size: int = -1, audio_pad_value: float | int = 0.0, whisper_language: str | None = None, whisper_task: str | None = None)

Bases: AbsPreprocessor

Common preprocessor for handling speech and text data.

This class is responsible for processing speech and text data for various tasks in speech processing. It applies data augmentation, normalization, and handles tokenization for the text inputs.

train

Indicates whether the preprocessor is in training mode.

Type: bool

speech_name

The key for accessing speech data in the input dict.

Type: str

text_name

The key for accessing text data in the input dict.

Type: str

speech_volume_normalize

Normalization factor for speech volume.

Type: float

force_single_channel

If True, forces single-channel output.

Type: bool

rir_apply_prob

Probability of applying Room Impulse Response (RIR).

Type: float

noise_apply_prob

Probability of applying noise augmentation.

Type: float

short_noise_thres

Threshold for short noise application.

Type: float

aux_task_names

Names of auxiliary tasks for processing.

Type: Collection[str]

rirs

List of paths to RIR files.

Type: List[str]

noises

List of paths to noise files.

Type: List[str]

data_aug

Object for applying data augmentation effects.

Type:DataAugmentation

min_sample_size

Minimum sample size for padding speech data.

Type: int

audio_pad_value

Value used for padding audio data.

Type: Union[float, int]

tokenizer

Tokenizer object for processing text.

Type: Tokenizer

token_id_converter

Converter for text tokens to IDs.

Type:TokenIDConverter

text_cleaner

Object for cleaning text inputs.

Type:TextCleaner
Parameters:
- train (bool) – Indicates whether to use in training mode.
- use_lang_prompt (bool) – If True, use language prompts.
- use_nlp_prompt (bool) – If True, use NLP prompts.
- token_type (Optional *[*str ]) – Type of tokenization to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – List of tokens for the tokenizer.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – Path to BPE model for tokenization.
- text_cleaner (Collection *[*str ]) – Collection of text cleaning rules.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol for spaces in tokenized text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Symbols to ignore.
- delimiter (Optional *[*str ]) – Delimiter for tokenization.
- force_single_channel (bool) – If True, force output to single channel.
- rir_scp (Optional *[*str ]) – Path to Room Impulse Response (RIR) SCP file.
- rir_apply_prob (float) – Probability of applying RIR.
- noise_scp (Optional *[*str ]) – Path to noise SCP file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for applying short noise.
- aux_task_names (Collection *[*str ]) – Names of auxiliary tasks.
- speech_volume_normalize (float) – Volume normalization factor for speech.
- fs (int) – Sampling rate for audio data.
- nonsplit_symbol (Iterable *[*str ]) – Symbols for non-splitting in tokenization.
- data_aug_effects (List) – Effects for data augmentation.
- data_aug_num (List *[*int ]) – Number of augmentations to apply.
- data_aug_prob (float) – Probability of applying data augmentation.
- min_sample_size (int) – Minimum sample size for chunking.
- audio_pad_value (Union *[*float , int ]) – Value for padding audio.
- whisper_language (Optional *[*str ]) – Language for Whisper models.
- whisper_task (Optional *[*str ]) – Task for Whisper models.
Raises:ValueError – If token_list is required when token_type is specified.

Examples

Example of creating a CommonPreprocessor instance

preprocessor = CommonPreprocessor(

train=True, use_lang_prompt=True, token_type=’word’, token_list=’path/to/token_list.txt’, bpemodel=’path/to/bpemodel’, text_cleaner=[‘remove_punctuation’], g2p_type=’g2p_model’, speech_name=’speech’, text_name=’text’

)

Example of processing data

processed_data = preprocessor(uid=’sample_uid’, data={

‘speech’: np.array([0.1, 0.2, 0.3]), ‘text’: ‘Hello world!’

})