espnet2.train.preprocessor.MutliTokenizerCommonPreprocessor
espnet2.train.preprocessor.MutliTokenizerCommonPreprocessor
class espnet2.train.preprocessor.MutliTokenizerCommonPreprocessor(train: bool, token_type: List[str] = [None], token_list: List[Path | str | Iterable[str]] = [None], bpemodel: List[Path | str | Iterable[str]] = [None], text_cleaner: Collection[str] | None = None, g2p_type: List[str] | str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: List[str] = ['text'], tokenizer_encode_conf: List[Dict] = [{}, {}], fs: int = 0, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, whisper_language: List[str] | None = None, whisper_task: str | None = None)
Bases: CommonPreprocessor
Preprocessor that supports multiple tokenizers for text processing.
This class extends the CommonPreprocessor to handle multiple tokenizers and their corresponding configurations. It allows for flexible tokenization strategies based on the provided token types and lists.
train
Indicates whether the preprocessor is in training mode.
- Type: bool
num_tokenizer
Number of tokenizers to be used.
- Type: int
tokenizer
List of tokenizer instances.
- Type: List
token_id_converter
List of token ID converters.
- Type: List
text_cleaner
Instance of the text cleaner.
- Type:TextCleaner
text_name
List of text names corresponding to each tokenizer.
Type: List[str]
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (List *[*str ]) – List of token types for each tokenizer.
- token_list (List *[*Union *[*Path , str , Iterable *[*str ] ] ]) – List of token lists for each tokenizer.
- bpemodel (List *[*Union *[*Path , str , Iterable *[*str ] ] ]) – List of BPE models for each tokenizer.
- text_cleaner (Collection *[*str ] , optional) – Cleaning rules for text.
- g2p_type (Union *[*List *[*str ] , str ] , optional) – Grapheme-to-phoneme conversion type.
- unk_symbol (str , optional) – Unknown symbol for tokenization.
- space_symbol (str , optional) – Space symbol for tokenization.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ] , optional) – Symbols to treat as non-linguistic.
- delimiter (Optional *[*str ] , optional) – Delimiter for tokenization.
- rir_scp (Optional *[*str ] , optional) – Path to the RIR scp file.
- rir_apply_prob (float , optional) – Probability of applying RIR.
- noise_scp (Optional *[*str ] , optional) – Path to the noise scp file.
- noise_apply_prob (float , optional) – Probability of applying noise.
- noise_db_range (str , optional) – Range of noise levels in dB.
- short_noise_thres (float , optional) – Threshold for short noise.
- speech_volume_normalize (float , optional) – Normalization factor for speech volume.
- speech_name (str , optional) – Name of the speech input.
- text_name (List *[*str ] , optional) – List of text names for processing.
- tokenizer_encode_conf (List *[*Dict ] , optional) – Configuration for tokenizer encoding.
- fs (int , optional) – Sampling frequency.
- data_aug_effects (List , optional) – Data augmentation effects.
- data_aug_num (List *[*int ] , optional) – Number of data augmentation operations.
- data_aug_prob (float , optional) – Probability of applying data augmentation.
- whisper_language (List *[*str ] , optional) – List of languages for Whisper tokenizer.
- whisper_task (Optional *[*str ] , optional) – Task type for Whisper tokenizer.
Raises:ValueError – If the length of token_type, token_list, bpemodel, or text_name do not match.
Examples
>>> preprocessor = MutliTokenizerCommonPreprocessor(
... train=True,
... token_type=["word", "bpe"],
... token_list=["path/to/word_list", "path/to/bpe_list"],
... bpemodel=["path/to/bpe_model"],
... text_name=["text", "transcript"]
... )
>>> data = {
... "text": "Hello world",
... "transcript": "Hello world transcript"
... }
>>> processed_data = preprocessor("uid_1", data)