espnet2.train.preprocessor.CommonPreprocessor_multi
espnet2.train.preprocessor.CommonPreprocessor_multi
class espnet2.train.preprocessor.CommonPreprocessor_multi(train: bool, use_lang_prompt: bool = False, use_nlp_prompt: bool = False, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, aux_task_names: Collection[str] | None = None, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: List[str] = ['text'], fs: int = 0, speaker_change_symbol: Iterable[str] | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, whisper_language: str | None = None, whisper_task: str | None = None)
Bases: CommonPreprocessor
Common preprocessor for multi-input text and speech data.
This preprocessor handles the processing of both speech and text data for training models that require multiple text inputs, such as those used in multi-speaker scenarios. It includes functionality for text tokenization, noise addition, reverberation effects, and other augmentations as specified in the constructor.
train
Indicates whether the preprocessor is in training mode.
- Type: bool
use_lang_prompt
Flag to use language prompts in processing.
- Type: bool
use_nlp_prompt
Flag to use NLP prompts in processing.
- Type: bool
token_type
Type of tokenization to be used.
- Type: Optional[str]
token_list
Path or list of tokens.
- Type: Union[Path, str, Iterable[str]]
bpemodel
BPE model path or list.
- Type: Union[Path, str, Iterable[str]]
text_cleaner
Collection of text cleaning methods.
- Type: Collection[str]
g2p_type
Type of G2P model to use.
- Type: Optional[str]
unk_symbol
Symbol for unknown tokens.
- Type: str
space_symbol
Symbol representing spaces.
- Type: str
non_linguistic_symbols
Non-linguistic symbols.
- Type: Union[Path, str, Iterable[str]]
delimiter
Delimiter for tokenization.
- Type: Optional[str]
rir_scp
Path to RIR (Room Impulse Response) script.
- Type: Optional[str]
rir_apply_prob
Probability of applying RIR effects.
- Type: float
noise_scp
Path to noise script.
- Type: Optional[str]
noise_apply_prob
Probability of applying noise.
- Type: float
noise_db_range
Range of noise levels in dB.
- Type: str
short_noise_thres
Threshold for short noise segments.
- Type: float
aux_task_names
Names of auxiliary tasks.
- Type: Collection[str]
speech_volume_normalize
Factor for normalizing speech volume.
- Type: float
speech_name
Key for accessing speech data in input dictionary.
- Type: str
text_name
List of keys for accessing text data.
- Type: List[str]
fs
Sampling frequency of the audio data.
- Type: int
speaker_change_symbol
Symbols indicating speaker changes.
- Type: Iterable[str]
data_aug_effects
Effects to apply for data augmentation.
- Type: List
data_aug_num
Number of augmentations to apply.
- Type: List[int]
data_aug_prob
Probability of applying data augmentations.
Type: float
Parameters:
- train (bool) – Whether to use in training mode.
- use_lang_prompt (bool) – Flag to use language prompts in processing.
- use_nlp_prompt (bool) – Flag to use NLP prompts in processing.
- token_type (Optional *[*str ]) – Type of tokenization to be used.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of tokens.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model path or list.
- text_cleaner (Collection *[*str ]) – Collection of text cleaning methods.
- g2p_type (Optional *[*str ]) – Type of G2P model to use.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol representing spaces.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Non-linguistic symbols.
- delimiter (Optional *[*str ]) – Delimiter for tokenization.
- rir_scp (Optional *[*str ]) – Path to RIR (Room Impulse Response) script.
- rir_apply_prob (float) – Probability of applying RIR effects.
- noise_scp (Optional *[*str ]) – Path to noise script.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for short noise segments.
- aux_task_names (Collection *[*str ]) – Names of auxiliary tasks.
- speech_volume_normalize (float) – Factor for normalizing speech volume.
- speech_name (str) – Key for accessing speech data in input dictionary.
- text_name (List *[*str ]) – List of keys for accessing text data.
- fs (int) – Sampling frequency of the audio data.
- speaker_change_symbol (Iterable *[*str ]) – Symbols indicating speaker changes.
- data_aug_effects (List) – Effects to apply for data augmentation.
- data_aug_num (List *[*int ]) – Number of augmentations to apply.
- data_aug_prob (float) – Probability of applying data augmentations.
Examples
>>> preprocessor = CommonPreprocessor_multi(
... train=True,
... token_type="word",
... token_list=['<unk>', '<space>', "hello", "world"],
... noise_apply_prob=0.5,
... speech_name="audio",
... text_name=["transcript", "summary"]
... )
>>> processed_data = preprocessor(uid="example_id", data={
... "audio": np.random.rand(16000), # 1 second of audio
... "transcript": "hello world",
... "summary": "a brief summary"
... })
NOTE
Ensure that the input data dictionary contains the specified keys for speech and text data.