espnet2.train.preprocessor.S2TCTCPreprocessor
espnet2.train.preprocessor.S2TCTCPreprocessor
class espnet2.train.preprocessor.S2TCTCPreprocessor(train: bool, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: str = 'text', text_prev_name: str = 'text_prev', text_ctc_name: str = 'text_ctc', fs: int = 16000, na_symbol: str = '<na>', speech_length: float = 30, speech_init_silence: float = 1.0, text_prev_apply_prob: float = 0.5, lang_apply_prob: float = 0.5, nolang_symbol: str = '<nolang>')
Bases: CommonPreprocessor
Preprocessor for OWSM-CTC.
This class handles the preprocessing of input data specifically for the Online Waveform Speech Model with Connectionist Temporal Classification (OWSM-CTC). It prepares speech and text data for training and inference, including augmentations and normalization. The preprocessor can pad or trim audio signals and apply various transformations to text inputs, such as conditioning on previous text or handling special tokens.
train
Whether to use in training mode.
- Type: bool
token_type
Type of tokenization to use.
- Type: Optional[str]
token_list
List of tokens.
- Type: Union[Path, str, Iterable[str]]
bpemodel
BPE model for tokenization.
- Type: Union[Path, str, Iterable[str]]
text_cleaner
Collection of text cleaning methods.
- Type: Collection[str]
g2p_type
Type of grapheme-to-phoneme conversion.
- Type: Optional[str]
unk_symbol
Symbol for unknown tokens.
- Type: str
space_symbol
Symbol representing space in text.
- Type: str
non_linguistic_symbols
Non-linguistic symbols.
- Type: Union[Path, str, Iterable[str]]
delimiter
Delimiter for tokenization.
- Type: str
rir_scp
Path to the RIR (Room Impulse Response) scp file.
- Type: str
rir_apply_prob
Probability of applying RIR.
- Type: float
noise_scp
Path to the noise scp file.
- Type: str
noise_apply_prob
Probability of applying noise.
- Type: float
noise_db_range
Range of noise levels in dB.
- Type: str
short_noise_thres
Threshold for short noise.
- Type: float
speech_volume_normalize
Volume normalization factor for speech.
- Type: float
speech_name
Key for speech data in the input dictionary.
- Type: str
text_name
Key for text data in the input dictionary.
- Type: str
text_prev_name
Key for previous text data in the input dictionary.
- Type: str
text_ctc_name
Key for CTC text data in the input dictionary.
- Type: str
fs
Sampling rate for the audio data.
- Type: int
na_symbol
Symbol indicating text is not available (e.g., for prev or ctc).
- Type: str
speech_length
Target length for speech data in samples.
- Type: int
speech_init_silence
Maximum silence to add before speech during data augmentation.
- Type: int
text_prev_apply_prob
Probability of using previous text for conditioning.
- Type: float
lang_apply_prob
Probability of using ground truth language instead of unknown.
- Type: float
nolang
Token ID for the no language symbol.
Type: int
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (Optional *[*str ]) – Type of tokenization to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – List of tokens.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model for tokenization.
- text_cleaner (Collection *[*str ]) – Collection of text cleaning methods.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol representing space in text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Non-linguistic symbols.
- delimiter (str) – Delimiter for tokenization.
- rir_scp (str) – Path to the RIR (Room Impulse Response) scp file.
- rir_apply_prob (float) – Probability of applying RIR.
- noise_scp (str) – Path to the noise scp file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for short noise.
- speech_volume_normalize (float) – Volume normalization factor for speech.
- speech_name (str) – Key for speech data in the input dictionary.
- text_name (str) – Key for text data in the input dictionary.
- text_prev_name (str) – Key for previous text data in the input dictionary.
- text_ctc_name (str) – Key for CTC text data in the input dictionary.
- fs (int) – Sampling rate for the audio data.
- na_symbol (str) – Symbol indicating text is not available (e.g., for prev or ctc).
- speech_length (float) – Target length for speech data in seconds.
- speech_init_silence (float) – Maximum silence to add before speech during data augmentation.
- text_prev_apply_prob (float) – Probability of using previous text for conditioning.
- lang_apply_prob (float) – Probability of using ground truth language instead of unknown.
- nolang_symbol (str) – Symbol for no language.
Examples
>>> preprocessor = S2TCTCPreprocessor(train=True)
>>> processed_data = preprocessor(uid="example_uid", data={"speech": speech_data, "text": "Hello World"})
>>> assert "speech" in processed_data
>>> assert "text" in processed_data