espnet2.train.preprocessor.SpeechLMPreprocessor

About 5 min

espnet2.train.preprocessor.SpeechLMPreprocessor

class espnet2.train.preprocessor.SpeechLMPreprocessor(token_list: List, token_bias: Dict, encoder_decoder_format: bool = False, codec_token_per_frame: int = 1, codec_token_in_use: int | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, g2p_type: str | None = None, bpemodel: Path | str | Iterable[str] | None = None, bpe_encode_kwargs: Dict | None = None, text_cleaner: str | None = None, speaker_prompt_length: int = 1800)

Bases: AbsPreprocessor

Preprocessor specifically for SpeechLM models.

This class handles the preprocessing steps required for SpeechLM models, including tokenization and modality-specific processing of speech and text inputs. It utilizes various tokenizers and encoders to convert raw data into the appropriate format for training or inference.

token_list

A list of tokens used for encoding.

Type: List

token_bias

A dictionary containing bias values for different token types.

Type: Dict

encoder_decoder_format

Flag indicating if the output should follow encoder-decoder format.

Type: bool

codec_token_per_frame

Number of codec tokens per frame.

Type: int

codec_token_in_use

Number of codec tokens to use.

Type: int

unk_symbol

Symbol representing unknown tokens.

Type: str

space_symbol

Symbol representing spaces in the text.

Type: str

non_linguistic_symbols

Symbols that are not linguistic.

Type: Union[Path, str, Iterable[str]]

g2p_type

Type of grapheme-to-phoneme model used.

Type: str

bpemodel

BPE model for tokenization.

Type: Union[Path, str, Iterable[str]]

bpe_encode_kwargs

Additional arguments for BPE encoding.

Type: Dict

text_cleaner

Method for cleaning text input.

Type: str

speaker_prompt_length

Length of the speaker prompt in tokens.

Type: int
Parameters:
- token_list (List) – A list of tokens for encoding.
- token_bias (Dict) – A dictionary mapping modalities to their biases.
- encoder_decoder_format (bool) – If True, outputs in encoder-decoder format.
- codec_token_per_frame (int) – Number of codec tokens per frame.
- codec_token_in_use (int , optional) – Number of codec tokens to use. Defaults to None.
- unk_symbol (str , optional) – Symbol for unknown tokens. Defaults to “<unk>”.
- space_symbol (str , optional) – Symbol for spaces. Defaults to “<space>”.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ] , optional) – Non-linguistic symbols. Defaults to None.
- g2p_type (str , optional) – Grapheme-to-phoneme type. Defaults to None.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ] , optional) – BPE model path. Defaults to None.
- bpe_encode_kwargs (Dict , optional) – Additional BPE encoding arguments. Defaults to None.
- text_cleaner (str , optional) – Method for cleaning text. Defaults to None.
- speaker_prompt_length (int , optional) – Length of the speaker prompt. Defaults to 1800.
Raises:
- ValueError – If continuous features are not supported in modality processing.
- NotImplementedError – If a modality is not supported.

########### Examples

>>> preprocessor = SpeechLMPreprocessor(
...     token_list=['&lt;pad&gt;', '&lt;unk&gt;', '&lt;sos&gt;', '&lt;eos&gt;'],
...     token_bias={"codec": 0.1, "ssl": 0.2},
...     encoder_decoder_format=True
... )
>>> data = {
...     "speech": np.random.rand(16000),
...     "text": "Hello, how are you?"
... }
>>> processed_data = preprocessor("task_name", data)

diagnose(data)

Preprocessor specifically for SpeechLM models.

This class handles the preprocessing of input data for SpeechLM models, including tokenization, special token handling, and modality-specific processing. It prepares sequences for both encoder and decoder.

token_list

List of tokens used for the model.

Type: List

token_bias

Bias values for different modalities.

Type: Dict

encoder_decoder_format

Whether to use encoder-decoder format.

Type: bool

codec_token_per_frame

Number of codec tokens per frame.

Type: int

codec_token_in_use

Number of codec tokens currently in use.

Type: int

unk_symbol

Symbol for unknown tokens.

Type: str

space_symbol

Symbol for space.

Type: str

non_linguistic_symbols

Symbols that are non-linguistic.

Type: Union[Path, str, Iterable[str]]

g2p_type

Type of grapheme-to-phoneme conversion.

Type: str

bpemodel

BPE model for tokenization.

Type: Union[Path, str, Iterable[str]]

bpe_encode_kwargs

Additional kwargs for BPE encoding.

Type: Dict

text_cleaner

Text cleaning method.

Type: str

speaker_prompt_length

Length of the speaker prompt.

Type: int
Parameters:
- token_list (List) – List of tokens for the model.
- token_bias (Dict) – Bias values for various modalities.
- encoder_decoder_format (bool) – Use encoder-decoder format.
- codec_token_per_frame (int) – Number of codec tokens per frame.
- codec_token_in_use (int , optional) – Number of codec tokens in use.
- unk_symbol (str , optional) – Symbol for unknown tokens (default: “<unk>”).
- space_symbol (str , optional) – Symbol for space (default: “<space>”).
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ] , optional) – Non-linguistic symbols (default: None).
- g2p_type (str , optional) – Grapheme-to-phoneme type (default: None).
- bpemodel (Union *[*Path , str , Iterable *[*str ] ] , optional) – BPE model (default: None).
- bpe_encode_kwargs (Dict , optional) – BPE encoding kwargs (default: None).
- text_cleaner (str , optional) – Text cleaner (default: None).
- speaker_prompt_length (int , optional) – Length of the speaker prompt (default: 1800).
Returns: Preprocessed data ready for the SpeechLM model.
Return type: Dict[str, np.ndarray]

########### Examples

>>> preprocessor = SpeechLMPreprocessor(token_list=['&lt;sos/eos&gt;', '&lt;pad&gt;'],
...                                       token_bias={"codec": 0.1},
...                                       encoder_decoder_format=True)
>>> uid = "task_name"
>>> data = {"input": np.array([1, 2, 3])}
>>> processed_data = preprocessor(uid, data)
>>> print(processed_data.keys())
dict_keys(['enc_seq', 'dec_seq', 'prefix_len'])

Raises:
- ValueError – If continuous features are passed to the preprocessor.
- NotImplementedError – If a modality type is not supported.

NOTE

The diagnose method is available for debugging the input sequences and their formats.

modality_specific_processing(value, modality)

Processes the input value based on the specified modality.

This method reshapes, pads, and tokenizes the input value according to the modality type. It supports various modalities such as codec, speaker, text_bpe, and g2p.

Parameters:
- value (np.ndarray) – The input data to be processed.
- modality (str) – The type of modality. Supported modalities include:
  - “codec”
  - “spk”
  - “text_bpe”
  - “g2p”
  - “ssl”
Returns:
- The processed value as a numpy array.
- An optional continuous feature (currently not used).
Return type: Tuple[np.ndarray, Optional[np.ndarray]]
Raises:NotImplementedError – If an unsupported modality is specified.

########### Examples

>>> processor = SpeechLMPreprocessor(token_list=['&lt;pad&gt;', '&lt;unk&gt;', 'hello'],
...                                   token_bias={'codec': 1.0, 'ssl': 0.5})
>>> value = np.array([1, 2, 3])
>>> processed_value, conti_feat = processor.modality_specific_processing(value, 'codec')
>>> print(processed_value)
array([...])  # Example output after processing

>>> processed_value, conti_feat = processor.modality_specific_processing(value, 'text_bpe')
>>> print(processed_value)
array([...])  # Example output after processing

special_token(token)

Preprocessor specifically for SpeechLM models.

This class is responsible for preprocessing data for SpeechLM models. It handles various modalities, including text, codec, and speaker features, and processes them accordingly to prepare the data for model training or inference.

token_list

A list of tokens used in the model.

Type: List

token_bias

A dictionary containing biases for different token types.

Type: Dict

encoder_decoder_format

Whether to use encoder-decoder format.

Type: bool

codec_token_per_frame

Number of codec tokens per frame.

Type: int

codec_token_in_use

Number of codec tokens currently in use.

Type: int

unk_symbol

Symbol representing unknown tokens.

Type: str

space_symbol

Symbol representing spaces in text.

Type: str

non_linguistic_symbols

Non-linguistic symbols.

Type: Union[Path, str, Iterable[str]]

g2p_type

Type of grapheme-to-phoneme model used.

Type: str

bpemodel

BPE model path or definition.

Type: Union[Path, str, Iterable[str]]

bpe_encode_kwargs

Additional arguments for BPE encoding.

Type: Dict

text_cleaner

Cleaning rules for text data.

Type: str

speaker_prompt_length

Length of speaker prompt.

Type: int
Parameters:
- token_list (List) – A list of tokens used in the model.
- token_bias (Dict) – A dictionary containing biases for different token types.
- encoder_decoder_format (bool) – Whether to use encoder-decoder format.
- codec_token_per_frame (int) – Number of codec tokens per frame.
- codec_token_in_use (int) – Number of codec tokens currently in use.
- unk_symbol (str) – Symbol representing unknown tokens.
- space_symbol (str) – Symbol representing spaces in text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Non-linguistic symbols.
- g2p_type (str) – Type of grapheme-to-phoneme model used.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model path or definition.
- bpe_encode_kwargs (Dict) – Additional arguments for BPE encoding.
- text_cleaner (str) – Cleaning rules for text data.
- speaker_prompt_length (int) – Length of speaker prompt.

########### Examples

>>> preprocessor = SpeechLMPreprocessor(
...     token_list=['&lt;pad&gt;', '&lt;unk&gt;', '&lt;sos&gt;', '&lt;eos&gt;'],
...     token_bias={"codec": 0.5, "ssl": 0.1},
...     encoder_decoder_format=True,
...     codec_token_per_frame=2,
...     speaker_prompt_length=1800
... )
>>> processed_data = preprocessor(uid="example_uid", data={"text": "Hello"})
>>> assert "enc_seq" in processed_data
>>> assert "dec_seq" in processed_data

Raises:
- ValueError – If the modality specified is not supported.
- NotImplementedError – If continuous modalities are not supported.

NOTE

This class assumes that the input data has already been validated and that the required fields are present.