espnet2.train.preprocessor.CommonPreprocessor
espnet2.train.preprocessor.CommonPreprocessor
class espnet2.train.preprocessor.CommonPreprocessor(train: bool, use_lang_prompt: bool = False, use_nlp_prompt: bool = False, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, force_single_channel: bool = False, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, aux_task_names: Collection[str] | None = None, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: str = 'text', fs: int = 0, nonsplit_symbol: Iterable[str] | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, min_sample_size: int = -1, audio_pad_value: float | int = 0.0, whisper_language: str | None = None, whisper_task: str | None = None)
Bases: AbsPreprocessor
Common preprocessor for handling speech and text data.
This class is responsible for processing speech and text data for various tasks in speech processing. It applies data augmentation, normalization, and handles tokenization for the text inputs.
train
Indicates whether the preprocessor is in training mode.
- Type: bool
speech_name
The key for accessing speech data in the input dict.
- Type: str
text_name
The key for accessing text data in the input dict.
- Type: str
speech_volume_normalize
Normalization factor for speech volume.
- Type: float
force_single_channel
If True, forces single-channel output.
- Type: bool
rir_apply_prob
Probability of applying Room Impulse Response (RIR).
- Type: float
noise_apply_prob
Probability of applying noise augmentation.
- Type: float
short_noise_thres
Threshold for short noise application.
- Type: float
aux_task_names
Names of auxiliary tasks for processing.
- Type: Collection[str]
rirs
List of paths to RIR files.
- Type: List[str]
noises
List of paths to noise files.
- Type: List[str]
data_aug
Object for applying data augmentation effects.
- Type:DataAugmentation
min_sample_size
Minimum sample size for padding speech data.
- Type: int
audio_pad_value
Value used for padding audio data.
- Type: Union[float, int]
tokenizer
Tokenizer object for processing text.
- Type: Tokenizer
token_id_converter
Converter for text tokens to IDs.
- Type:TokenIDConverter
text_cleaner
Object for cleaning text inputs.
- Type:TextCleaner 
- Parameters: - train (bool) – Indicates whether to use in training mode.
- use_lang_prompt (bool) – If True, use language prompts.
- use_nlp_prompt (bool) – If True, use NLP prompts.
- token_type (Optional *[*str ]) – Type of tokenization to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – List of tokens for the tokenizer.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – Path to BPE model for tokenization.
- text_cleaner (Collection *[*str ]) – Collection of text cleaning rules.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol for spaces in tokenized text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Symbols to ignore.
- delimiter (Optional *[*str ]) – Delimiter for tokenization.
- force_single_channel (bool) – If True, force output to single channel.
- rir_scp (Optional *[*str ]) – Path to Room Impulse Response (RIR) SCP file.
- rir_apply_prob (float) – Probability of applying RIR.
- noise_scp (Optional *[*str ]) – Path to noise SCP file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for applying short noise.
- aux_task_names (Collection *[*str ]) – Names of auxiliary tasks.
- speech_volume_normalize (float) – Volume normalization factor for speech.
- fs (int) – Sampling rate for audio data.
- nonsplit_symbol (Iterable *[*str ]) – Symbols for non-splitting in tokenization.
- data_aug_effects (List) – Effects for data augmentation.
- data_aug_num (List *[*int ]) – Number of augmentations to apply.
- data_aug_prob (float) – Probability of applying data augmentation.
- min_sample_size (int) – Minimum sample size for chunking.
- audio_pad_value (Union *[*float , int ]) – Value for padding audio.
- whisper_language (Optional *[*str ]) – Language for Whisper models.
- whisper_task (Optional *[*str ]) – Task for Whisper models.
 
- Raises:ValueError – If token_list is required when token_type is specified. 
Examples
Example of creating a CommonPreprocessor instance
preprocessor = CommonPreprocessor(
train=True, use_lang_prompt=True, token_type=’word’, token_list=’path/to/token_list.txt’, bpemodel=’path/to/bpemodel’, text_cleaner=[‘remove_punctuation’], g2p_type=’g2p_model’, speech_name=’speech’, text_name=’text’
)
Example of processing data
processed_data = preprocessor(uid=’sample_uid’, data={
‘speech’: np.array([0.1, 0.2, 0.3]), ‘text’: ‘Hello world!’
})
