espnet2.text.build_tokenizer.build_tokenizer

About 2 min

espnet2.text.build_tokenizer.build_tokenizer

A helper function to instantiate a tokenizer based on the specified type.

This function creates an instance of a tokenizer based on the token_type provided. The function supports various tokenization methods, including BPE, Hugging Face, word, character, phoneme, and Whisper tokenizers.

espnet2.text.build_tokenizer.token_type

The type of tokenizer to instantiate. Must be one of: ‘bpe’, ‘hugging_face’, ‘word’, ‘char’, ‘phn’, or a whisper variant.

Type: str

espnet2.text.build_tokenizer.bpemodel

The path to the BPE model file or a string/iterable for models requiring this parameter.

Type: Optional[Union[Path, str, Iterable[str]]]

espnet2.text.build_tokenizer.non_linguistic_symbols

Symbols to be considered non-linguistic. Applicable for word and char tokenizers.

Type: Optional[Union[Path, str, Iterable[str]]]

espnet2.text.build_tokenizer.remove_non_linguistic_symbols

If True, removes non-linguistic symbols from the tokenization process. Not implemented for BPE and Hugging Face.

Type: bool

espnet2.text.build_tokenizer.space_symbol

The symbol used to represent spaces in the tokenization.

Type: str

espnet2.text.build_tokenizer.delimiter

The delimiter used for tokenizing text.

Type: Optional[str]

espnet2.text.build_tokenizer.g2p_type

Type of grapheme-to-phoneme conversion to use.

Type: Optional[str]

espnet2.text.build_tokenizer.nonsplit_symbol

Symbols that should not be split.

Type: Optional[Iterable[str]]

espnet2.text.build_tokenizer.encode_kwargs

Additional arguments for encoding (text to token).

Type: Optional[Dict]

espnet2.text.build_tokenizer.whisper_language

Language to be used for Whisper tokenizer.

Type: Optional[str]

espnet2.text.build_tokenizer.whisper_task

Task type for Whisper tokenizer (e.g., ‘transcribe’).

Type: Optional[str]

espnet2.text.build_tokenizer.sot_asr

Whether to use start-of-transcript for ASR.

Type: bool
Parameters:
- token_type (str) – Type of the tokenizer to build.
- bpemodel (Optional *[*Union *[*Path , str , Iterable *[*str ] ] ]) – BPE model for BPE and Hugging Face tokenizers.
- non_linguistic_symbols (Optional *[*Union *[*Path , str , Iterable *[*str ] ] ]) – Non-linguistic symbols for word and char tokenizers.
- remove_non_linguistic_symbols (bool) – Flag to remove non-linguistic symbols.
- space_symbol (str) – Symbol for spaces.
- delimiter (Optional *[*str ]) – Delimiter for word tokenization.
- g2p_type (Optional *[*str ]) – G2P conversion type for phoneme tokenization.
- nonsplit_symbol (Optional *[*Iterable *[*str ] ]) – Symbols not to be split.
- encode_kwargs (Optional *[*Dict ]) – Encoding arguments.
- whisper_language (Optional *[*str ]) – Language for Whisper tokenization.
- whisper_task (Optional *[*str ]) – Task type for Whisper.
- sot_asr (bool) – Start-of-transcript for ASR.
Returns: An instance of a tokenizer based on the specified type.
Return type:AbsTokenizer
Raises:
- ValueError – If token_type is not recognized or if bpemodel is required but not provided for BPE or Hugging Face tokenizers.
- RuntimeError – If remove_non_linguistic_symbols is used with BPE or Hugging Face tokenizers.

Examples

>>> tokenizer = build_tokenizer("bpe", bpemodel="path/to/bpe.model")
>>> tokens = tokenizer.encode("Hello world!")
>>> print(tokens)

>>> tokenizer = build_tokenizer("word", non_linguistic_symbols=["@"])
>>> tokens = tokenizer.encode("Hello @ world!")
>>> print(tokens)

>>> tokenizer = build_tokenizer("whisper", bpemodel="path/to/model",
... whisper_language="en", whisper_task="transcribe")
>>> tokens = tokenizer.encode("Hello world!")
>>> print(tokens)