espnet2.train.preprocessor.SpeechLMPreprocessor
espnet2.train.preprocessor.SpeechLMPreprocessor
class espnet2.train.preprocessor.SpeechLMPreprocessor(token_list: List, token_bias: Dict, encoder_decoder_format: bool = False, codec_token_per_frame: int = 1, codec_token_in_use: int | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, g2p_type: str | None = None, bpemodel: Path | str | Iterable[str] | None = None, bpe_encode_kwargs: Dict | None = None, text_cleaner: str | None = None, speaker_prompt_length: int = 1800)
Bases: AbsPreprocessor
Preprocessor specifically for SpeechLM models.
This class handles the preprocessing steps required for SpeechLM models, including tokenization and modality-specific processing of speech and text inputs. It utilizes various tokenizers and encoders to convert raw data into the appropriate format for training or inference.
token_list
A list of tokens used for encoding.
- Type: List
token_bias
A dictionary containing bias values for different token types.
- Type: Dict
encoder_decoder_format
Flag indicating if the output should follow encoder-decoder format.
- Type: bool
codec_token_per_frame
Number of codec tokens per frame.
- Type: int
codec_token_in_use
Number of codec tokens to use.
- Type: int
unk_symbol
Symbol representing unknown tokens.
- Type: str
space_symbol
Symbol representing spaces in the text.
- Type: str
non_linguistic_symbols
Symbols that are not linguistic.
- Type: Union[Path, str, Iterable[str]]
g2p_type
Type of grapheme-to-phoneme model used.
- Type: str
bpemodel
BPE model for tokenization.
- Type: Union[Path, str, Iterable[str]]
bpe_encode_kwargs
Additional arguments for BPE encoding.
- Type: Dict
text_cleaner
Method for cleaning text input.
- Type: str
speaker_prompt_length
Length of the speaker prompt in tokens.
Type: int
Parameters:
- token_list (List) – A list of tokens for encoding.
- token_bias (Dict) – A dictionary mapping modalities to their biases.
- encoder_decoder_format (bool) – If True, outputs in encoder-decoder format.
- codec_token_per_frame (int) – Number of codec tokens per frame.
- codec_token_in_use (int , optional) – Number of codec tokens to use. Defaults to None.
- unk_symbol (str , optional) – Symbol for unknown tokens. Defaults to “<unk>”.
- space_symbol (str , optional) – Symbol for spaces. Defaults to “<space>”.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ] , optional) – Non-linguistic symbols. Defaults to None.
- g2p_type (str , optional) – Grapheme-to-phoneme type. Defaults to None.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ] , optional) – BPE model path. Defaults to None.
- bpe_encode_kwargs (Dict , optional) – Additional BPE encoding arguments. Defaults to None.
- text_cleaner (str , optional) – Method for cleaning text. Defaults to None.
- speaker_prompt_length (int , optional) – Length of the speaker prompt. Defaults to 1800.
Raises:
- ValueError – If continuous features are not supported in modality processing.
- NotImplementedError – If a modality is not supported.
########### Examples
>>> preprocessor = SpeechLMPreprocessor(
... token_list=['<pad>', '<unk>', '<sos>', '<eos>'],
... token_bias={"codec": 0.1, "ssl": 0.2},
... encoder_decoder_format=True
... )
>>> data = {
... "speech": np.random.rand(16000),
... "text": "Hello, how are you?"
... }
>>> processed_data = preprocessor("task_name", data)
diagnose(data)
Preprocessor specifically for SpeechLM models.
This class handles the preprocessing of input data for SpeechLM models, including tokenization, special token handling, and modality-specific processing. It prepares sequences for both encoder and decoder.
token_list
List of tokens used for the model.
- Type: List
token_bias
Bias values for different modalities.
- Type: Dict
encoder_decoder_format
Whether to use encoder-decoder format.
- Type: bool
codec_token_per_frame
Number of codec tokens per frame.
- Type: int
codec_token_in_use
Number of codec tokens currently in use.
- Type: int
unk_symbol
Symbol for unknown tokens.
- Type: str
space_symbol
Symbol for space.
- Type: str
non_linguistic_symbols
Symbols that are non-linguistic.
- Type: Union[Path, str, Iterable[str]]
g2p_type
Type of grapheme-to-phoneme conversion.
- Type: str
bpemodel
BPE model for tokenization.
- Type: Union[Path, str, Iterable[str]]
bpe_encode_kwargs
Additional kwargs for BPE encoding.
- Type: Dict
text_cleaner
Text cleaning method.
- Type: str
speaker_prompt_length
Length of the speaker prompt.
Type: int
Parameters:
- token_list (List) – List of tokens for the model.
- token_bias (Dict) – Bias values for various modalities.
- encoder_decoder_format (bool) – Use encoder-decoder format.
- codec_token_per_frame (int) – Number of codec tokens per frame.
- codec_token_in_use (int , optional) – Number of codec tokens in use.
- unk_symbol (str , optional) – Symbol for unknown tokens (default: “<unk>”).
- space_symbol (str , optional) – Symbol for space (default: “<space>”).
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ] , optional) – Non-linguistic symbols (default: None).
- g2p_type (str , optional) – Grapheme-to-phoneme type (default: None).
- bpemodel (Union *[*Path , str , Iterable *[*str ] ] , optional) – BPE model (default: None).
- bpe_encode_kwargs (Dict , optional) – BPE encoding kwargs (default: None).
- text_cleaner (str , optional) – Text cleaner (default: None).
- speaker_prompt_length (int , optional) – Length of the speaker prompt (default: 1800).
Returns: Preprocessed data ready for the SpeechLM model.
Return type: Dict[str, np.ndarray]
########### Examples
>>> preprocessor = SpeechLMPreprocessor(token_list=['<sos/eos>', '<pad>'],
... token_bias={"codec": 0.1},
... encoder_decoder_format=True)
>>> uid = "task_name"
>>> data = {"input": np.array([1, 2, 3])}
>>> processed_data = preprocessor(uid, data)
>>> print(processed_data.keys())
dict_keys(['enc_seq', 'dec_seq', 'prefix_len'])
- Raises:
- ValueError – If continuous features are passed to the preprocessor.
- NotImplementedError – If a modality type is not supported.
NOTE
The diagnose method is available for debugging the input sequences and their formats.
modality_specific_processing(value, modality)
Processes the input value based on the specified modality.
This method reshapes, pads, and tokenizes the input value according to the modality type. It supports various modalities such as codec, speaker, text_bpe, and g2p.
- Parameters:
- value (np.ndarray) – The input data to be processed.
- modality (str) – The type of modality. Supported modalities include:
- “codec”
- “spk”
- “text_bpe”
- “g2p”
- “ssl”
- Returns:
- The processed value as a numpy array.
- An optional continuous feature (currently not used).
- Return type: Tuple[np.ndarray, Optional[np.ndarray]]
- Raises:NotImplementedError – If an unsupported modality is specified.
########### Examples
>>> processor = SpeechLMPreprocessor(token_list=['<pad>', '<unk>', 'hello'],
... token_bias={'codec': 1.0, 'ssl': 0.5})
>>> value = np.array([1, 2, 3])
>>> processed_value, conti_feat = processor.modality_specific_processing(value, 'codec')
>>> print(processed_value)
array([...]) # Example output after processing
>>> processed_value, conti_feat = processor.modality_specific_processing(value, 'text_bpe')
>>> print(processed_value)
array([...]) # Example output after processing
special_token(token)
Preprocessor specifically for SpeechLM models.
This class is responsible for preprocessing data for SpeechLM models. It handles various modalities, including text, codec, and speaker features, and processes them accordingly to prepare the data for model training or inference.
token_list
A list of tokens used in the model.
- Type: List
token_bias
A dictionary containing biases for different token types.
- Type: Dict
encoder_decoder_format
Whether to use encoder-decoder format.
- Type: bool
codec_token_per_frame
Number of codec tokens per frame.
- Type: int
codec_token_in_use
Number of codec tokens currently in use.
- Type: int
unk_symbol
Symbol representing unknown tokens.
- Type: str
space_symbol
Symbol representing spaces in text.
- Type: str
non_linguistic_symbols
Non-linguistic symbols.
- Type: Union[Path, str, Iterable[str]]
g2p_type
Type of grapheme-to-phoneme model used.
- Type: str
bpemodel
BPE model path or definition.
- Type: Union[Path, str, Iterable[str]]
bpe_encode_kwargs
Additional arguments for BPE encoding.
- Type: Dict
text_cleaner
Cleaning rules for text data.
- Type: str
speaker_prompt_length
Length of speaker prompt.
Type: int
Parameters:
- token_list (List) – A list of tokens used in the model.
- token_bias (Dict) – A dictionary containing biases for different token types.
- encoder_decoder_format (bool) – Whether to use encoder-decoder format.
- codec_token_per_frame (int) – Number of codec tokens per frame.
- codec_token_in_use (int) – Number of codec tokens currently in use.
- unk_symbol (str) – Symbol representing unknown tokens.
- space_symbol (str) – Symbol representing spaces in text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Non-linguistic symbols.
- g2p_type (str) – Type of grapheme-to-phoneme model used.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model path or definition.
- bpe_encode_kwargs (Dict) – Additional arguments for BPE encoding.
- text_cleaner (str) – Cleaning rules for text data.
- speaker_prompt_length (int) – Length of speaker prompt.
########### Examples
>>> preprocessor = SpeechLMPreprocessor(
... token_list=['<pad>', '<unk>', '<sos>', '<eos>'],
... token_bias={"codec": 0.5, "ssl": 0.1},
... encoder_decoder_format=True,
... codec_token_per_frame=2,
... speaker_prompt_length=1800
... )
>>> processed_data = preprocessor(uid="example_uid", data={"text": "Hello"})
>>> assert "enc_seq" in processed_data
>>> assert "dec_seq" in processed_data
- Raises:
- ValueError – If the modality specified is not supported.
- NotImplementedError – If continuous modalities are not supported.
NOTE
This class assumes that the input data has already been validated and that the required fields are present.