espnet2.train.preprocessor.MutliTokenizerCommonPreprocessor

About 2 min

espnet2.train.preprocessor.MutliTokenizerCommonPreprocessor

class espnet2.train.preprocessor.MutliTokenizerCommonPreprocessor(train: bool, token_type: List[str] = [None], token_list: List[Path | str | Iterable[str]] = [None], bpemodel: List[Path | str | Iterable[str]] = [None], text_cleaner: Collection[str] | None = None, g2p_type: List[str] | str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: List[str] = ['text'], tokenizer_encode_conf: List[Dict] = [{}, {}], fs: int = 0, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, whisper_language: List[str] | None = None, whisper_task: str | None = None)

Bases: CommonPreprocessor

Preprocessor that supports multiple tokenizers for text processing.

This class extends the CommonPreprocessor to handle multiple tokenizers and their corresponding configurations. It allows for flexible tokenization strategies based on the provided token types and lists.

train

Indicates whether the preprocessor is in training mode.

Type: bool

num_tokenizer

Number of tokenizers to be used.

Type: int

tokenizer

List of tokenizer instances.

Type: List

token_id_converter

List of token ID converters.

Type: List

text_cleaner

Instance of the text cleaner.

Type:TextCleaner

text_name

List of text names corresponding to each tokenizer.

Type: List[str]
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (List *[*str ]) – List of token types for each tokenizer.
- token_list (List *[*Union *[*Path , str , Iterable *[*str ] ] ]) – List of token lists for each tokenizer.
- bpemodel (List *[*Union *[*Path , str , Iterable *[*str ] ] ]) – List of BPE models for each tokenizer.
- text_cleaner (Collection *[*str ] , optional) – Cleaning rules for text.
- g2p_type (Union *[*List *[*str ] , str ] , optional) – Grapheme-to-phoneme conversion type.
- unk_symbol (str , optional) – Unknown symbol for tokenization.
- space_symbol (str , optional) – Space symbol for tokenization.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ] , optional) – Symbols to treat as non-linguistic.
- delimiter (Optional *[*str ] , optional) – Delimiter for tokenization.
- rir_scp (Optional *[*str ] , optional) – Path to the RIR scp file.
- rir_apply_prob (float , optional) – Probability of applying RIR.
- noise_scp (Optional *[*str ] , optional) – Path to the noise scp file.
- noise_apply_prob (float , optional) – Probability of applying noise.
- noise_db_range (str , optional) – Range of noise levels in dB.
- short_noise_thres (float , optional) – Threshold for short noise.
- speech_volume_normalize (float , optional) – Normalization factor for speech volume.
- speech_name (str , optional) – Name of the speech input.
- text_name (List *[*str ] , optional) – List of text names for processing.
- tokenizer_encode_conf (List *[*Dict ] , optional) – Configuration for tokenizer encoding.
- fs (int , optional) – Sampling frequency.
- data_aug_effects (List , optional) – Data augmentation effects.
- data_aug_num (List *[*int ] , optional) – Number of data augmentation operations.
- data_aug_prob (float , optional) – Probability of applying data augmentation.
- whisper_language (List *[*str ] , optional) – List of languages for Whisper tokenizer.
- whisper_task (Optional *[*str ] , optional) – Task type for Whisper tokenizer.
Raises:ValueError – If the length of token_type, token_list, bpemodel, or text_name do not match.

Examples

>>> preprocessor = MutliTokenizerCommonPreprocessor(
...     train=True,
...     token_type=["word", "bpe"],
...     token_list=["path/to/word_list", "path/to/bpe_list"],
...     bpemodel=["path/to/bpe_model"],
...     text_name=["text", "transcript"]
... )
>>> data = {
...     "text": "Hello world",
...     "transcript": "Hello world transcript"
... }
>>> processed_data = preprocessor("uid_1", data)