espnet2.train.preprocessor.S2TCTCPreprocessor

About 3 min

espnet2.train.preprocessor.S2TCTCPreprocessor

class espnet2.train.preprocessor.S2TCTCPreprocessor(train: bool, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: str = 'text', text_prev_name: str = 'text_prev', text_ctc_name: str = 'text_ctc', fs: int = 16000, na_symbol: str = '<na>', speech_length: float = 30, speech_init_silence: float = 1.0, text_prev_apply_prob: float = 0.5, lang_apply_prob: float = 0.5, nolang_symbol: str = '<nolang>')

Bases: CommonPreprocessor

Preprocessor for OWSM-CTC.

This class handles the preprocessing of input data specifically for the Online Waveform Speech Model with Connectionist Temporal Classification (OWSM-CTC). It prepares speech and text data for training and inference, including augmentations and normalization. The preprocessor can pad or trim audio signals and apply various transformations to text inputs, such as conditioning on previous text or handling special tokens.

train

Whether to use in training mode.

Type: bool

token_type

Type of tokenization to use.

Type: Optional[str]

token_list

List of tokens.

Type: Union[Path, str, Iterable[str]]

bpemodel

BPE model for tokenization.

Type: Union[Path, str, Iterable[str]]

text_cleaner

Collection of text cleaning methods.

Type: Collection[str]

g2p_type

Type of grapheme-to-phoneme conversion.

Type: Optional[str]

unk_symbol

Symbol for unknown tokens.

Type: str

space_symbol

Symbol representing space in text.

Type: str

non_linguistic_symbols

Non-linguistic symbols.

Type: Union[Path, str, Iterable[str]]

delimiter

Delimiter for tokenization.

Type: str

rir_scp

Path to the RIR (Room Impulse Response) scp file.

Type: str

rir_apply_prob

Probability of applying RIR.

Type: float

noise_scp

Path to the noise scp file.

Type: str

noise_apply_prob

Probability of applying noise.

Type: float

noise_db_range

Range of noise levels in dB.

Type: str

short_noise_thres

Threshold for short noise.

Type: float

speech_volume_normalize

Volume normalization factor for speech.

Type: float

speech_name

Key for speech data in the input dictionary.

Type: str

text_name

Key for text data in the input dictionary.

Type: str

text_prev_name

Key for previous text data in the input dictionary.

Type: str

text_ctc_name

Key for CTC text data in the input dictionary.

Type: str

Sampling rate for the audio data.

Type: int

na_symbol

Symbol indicating text is not available (e.g., for prev or ctc).

Type: str

speech_length

Target length for speech data in samples.

Type: int

speech_init_silence

Maximum silence to add before speech during data augmentation.

Type: int

text_prev_apply_prob

Probability of using previous text for conditioning.

Type: float

lang_apply_prob

Probability of using ground truth language instead of unknown.

Type: float

nolang

Token ID for the no language symbol.

Type: int
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (Optional *[*str ]) – Type of tokenization to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – List of tokens.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model for tokenization.
- text_cleaner (Collection *[*str ]) – Collection of text cleaning methods.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol representing space in text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Non-linguistic symbols.
- delimiter (str) – Delimiter for tokenization.
- rir_scp (str) – Path to the RIR (Room Impulse Response) scp file.
- rir_apply_prob (float) – Probability of applying RIR.
- noise_scp (str) – Path to the noise scp file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for short noise.
- speech_volume_normalize (float) – Volume normalization factor for speech.
- speech_name (str) – Key for speech data in the input dictionary.
- text_name (str) – Key for text data in the input dictionary.
- text_prev_name (str) – Key for previous text data in the input dictionary.
- text_ctc_name (str) – Key for CTC text data in the input dictionary.
- fs (int) – Sampling rate for the audio data.
- na_symbol (str) – Symbol indicating text is not available (e.g., for prev or ctc).
- speech_length (float) – Target length for speech data in seconds.
- speech_init_silence (float) – Maximum silence to add before speech during data augmentation.
- text_prev_apply_prob (float) – Probability of using previous text for conditioning.
- lang_apply_prob (float) – Probability of using ground truth language instead of unknown.
- nolang_symbol (str) – Symbol for no language.

Examples

>>> preprocessor = S2TCTCPreprocessor(train=True)
>>> processed_data = preprocessor(uid="example_uid", data={"speech": speech_data, "text": "Hello World"})
>>> assert "speech" in processed_data
>>> assert "text" in processed_data