espnet2.train.preprocessor.CommonPreprocessor_multi

About 3 min

espnet2.train.preprocessor.CommonPreprocessor_multi

class espnet2.train.preprocessor.CommonPreprocessor_multi(train: bool, use_lang_prompt: bool = False, use_nlp_prompt: bool = False, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, aux_task_names: Collection[str] | None = None, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: List[str] = ['text'], fs: int = 0, speaker_change_symbol: Iterable[str] | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, whisper_language: str | None = None, whisper_task: str | None = None)

Bases: CommonPreprocessor

Common preprocessor for multi-input text and speech data.

This preprocessor handles the processing of both speech and text data for training models that require multiple text inputs, such as those used in multi-speaker scenarios. It includes functionality for text tokenization, noise addition, reverberation effects, and other augmentations as specified in the constructor.

train

Indicates whether the preprocessor is in training mode.

Type: bool

use_lang_prompt

Flag to use language prompts in processing.

Type: bool

use_nlp_prompt

Flag to use NLP prompts in processing.

Type: bool

token_type

Type of tokenization to be used.

Type: Optional[str]

token_list

Path or list of tokens.

Type: Union[Path, str, Iterable[str]]

bpemodel

BPE model path or list.

Type: Union[Path, str, Iterable[str]]

text_cleaner

Collection of text cleaning methods.

Type: Collection[str]

g2p_type

Type of G2P model to use.

Type: Optional[str]

unk_symbol

Symbol for unknown tokens.

Type: str

space_symbol

Symbol representing spaces.

Type: str

non_linguistic_symbols

Non-linguistic symbols.

Type: Union[Path, str, Iterable[str]]

delimiter

Delimiter for tokenization.

Type: Optional[str]

rir_scp

Path to RIR (Room Impulse Response) script.

Type: Optional[str]

rir_apply_prob

Probability of applying RIR effects.

Type: float

noise_scp

Path to noise script.

Type: Optional[str]

noise_apply_prob

Probability of applying noise.

Type: float

noise_db_range

Range of noise levels in dB.

Type: str

short_noise_thres

Threshold for short noise segments.

Type: float

aux_task_names

Names of auxiliary tasks.

Type: Collection[str]

speech_volume_normalize

Factor for normalizing speech volume.

Type: float

speech_name

Key for accessing speech data in input dictionary.

Type: str

text_name

List of keys for accessing text data.

Type: List[str]

Sampling frequency of the audio data.

Type: int

speaker_change_symbol

Symbols indicating speaker changes.

Type: Iterable[str]

data_aug_effects

Effects to apply for data augmentation.

Type: List

data_aug_num

Number of augmentations to apply.

Type: List[int]

data_aug_prob

Probability of applying data augmentations.

Type: float
Parameters:
- train (bool) – Whether to use in training mode.
- use_lang_prompt (bool) – Flag to use language prompts in processing.
- use_nlp_prompt (bool) – Flag to use NLP prompts in processing.
- token_type (Optional *[*str ]) – Type of tokenization to be used.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of tokens.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model path or list.
- text_cleaner (Collection *[*str ]) – Collection of text cleaning methods.
- g2p_type (Optional *[*str ]) – Type of G2P model to use.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol representing spaces.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Non-linguistic symbols.
- delimiter (Optional *[*str ]) – Delimiter for tokenization.
- rir_scp (Optional *[*str ]) – Path to RIR (Room Impulse Response) script.
- rir_apply_prob (float) – Probability of applying RIR effects.
- noise_scp (Optional *[*str ]) – Path to noise script.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for short noise segments.
- aux_task_names (Collection *[*str ]) – Names of auxiliary tasks.
- speech_volume_normalize (float) – Factor for normalizing speech volume.
- speech_name (str) – Key for accessing speech data in input dictionary.
- text_name (List *[*str ]) – List of keys for accessing text data.
- fs (int) – Sampling frequency of the audio data.
- speaker_change_symbol (Iterable *[*str ]) – Symbols indicating speaker changes.
- data_aug_effects (List) – Effects to apply for data augmentation.
- data_aug_num (List *[*int ]) – Number of augmentations to apply.
- data_aug_prob (float) – Probability of applying data augmentations.

Examples

>>> preprocessor = CommonPreprocessor_multi(
...     train=True,
...     token_type="word",
...     token_list=['&lt;unk&gt;', '&lt;space&gt;', "hello", "world"],
...     noise_apply_prob=0.5,
...     speech_name="audio",
...     text_name=["transcript", "summary"]
... )
>>> processed_data = preprocessor(uid="example_id", data={
...     "audio": np.random.rand(16000),  # 1 second of audio
...     "transcript": "hello world",
...     "summary": "a brief summary"
... })

NOTE

Ensure that the input data dictionary contains the specified keys for speech and text data.