espnet2.train.preprocessor.S2TPreprocessor

About 2 min

espnet2.train.preprocessor.S2TPreprocessor

class espnet2.train.preprocessor.S2TPreprocessor(train: bool, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech', text_name: str = 'text', text_prev_name: str = 'text_prev', text_ctc_name: str = 'text_ctc', fs: int = 16000, na_symbol: str = '<na>', speech_length: float = 30, speech_resolution: float = 0.02, speech_init_silence: float = 1.0, text_prev_apply_prob: float = 0.5, time_apply_prob: float = 0.5, notime_symbol: str = '<notimestamps>', first_time_symbol: str = '<0.00>', last_time_symbol: str = '<30.00>')

Bases: CommonPreprocessor

Preprocessor for speech-to-text tasks.

This class processes speech and text data for training and evaluation in speech-to-text tasks. It includes functionalities for tokenization, noise augmentation, speech length adjustments, and handling of special tokens.

text_prev_name

Name of the previous text input.

Type: str

text_ctc_name

Name of the CTC text input.

Type: str

speech_length

Desired speech length in samples.

Type: int

speech_resolution

Time resolution for speech processing in samples.

Type: int

speech_init_silence

Initial silence duration before speech in samples.

Type: int

text_prev_apply_prob

Probability of applying the previous text.

Type: float

time_apply_prob

Probability of including timestamps.

Type: float

na_symbol

Token for unavailable text data.

Type: str

notime

Token ID for no timestamp.

Type: int

first_time

Token ID for the first timestamp.

Type: int

last_time

Token ID for the last timestamp.

Type: int
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (Optional *[*str ]) – Type of tokenizer to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of token IDs.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – Path to BPE model.
- text_cleaner (Collection *[*str ]) – Text cleaning configurations.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol for unknown tokens.
- space_symbol (str) – Symbol for space tokens.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Path to non-linguistic symbols.
- delimiter (Optional *[*str ]) – Delimiter for tokenization.
- rir_scp (Optional *[*str ]) – Path to RIR (Room Impulse Response) script file.
- rir_apply_prob (float) – Probability of applying RIR.
- noise_scp (Optional *[*str ]) – Path to noise script file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for short noise.
- speech_volume_normalize (float) – Volume normalization factor for speech.
- speech_name (str) – Key for speech data in input.
- text_name (str) – Key for text data in input.
- text_prev_name (str) – Key for previous text data in input.
- text_ctc_name (str) – Key for CTC text data in input.
- fs (int) – Sampling frequency of the audio data.
- na_symbol (str) – Token for unavailable text data.
- speech_length (float) – Desired speech length in seconds.
- speech_resolution (float) – Time resolution for speech processing in seconds.
- speech_init_silence (float) – Max silence duration before speech in seconds.
- text_prev_apply_prob (float) – Probability of applying previous text.
- time_apply_prob (float) – Probability of including timestamps.
- notime_symbol (str) – Token for no timestamps.
- first_time_symbol (str) – Token for the first timestamp.
- last_time_symbol (str) – Token for the last timestamp.

Examples

>>> preprocessor = S2TPreprocessor(train=True)
>>> processed_data = preprocessor(uid="example_uid", data={
...     "speech": np.random.randn(16000),
...     "text": "This is an example.",
...     "text_prev": "This is previous text.",
...     "text_ctc": "CTC text example."
... })

NOTE

The speech length will be padded or trimmed to the specified value, and silence may be added at the beginning of the speech.

Raises:ValueError – If the token list is not provided when the token type is specified.