espnet2.text.build_tokenizer.build_tokenizer
espnet2.text.build_tokenizer.build_tokenizer
espnet2.text.build_tokenizer.build_tokenizer(token_type: str, bpemodel: Path | str | Iterable[str] | None = None, non_linguistic_symbols: Path | str | Iterable[str] | None = None, remove_non_linguistic_symbols: bool = False, space_symbol: str = '<space>', delimiter: str | None = None, g2p_type: str | None = None, nonsplit_symbol: Iterable[str] | None = None, encode_kwargs: Dict | None = None, whisper_language: str | None = None, whisper_task: str | None = None, sot_asr: bool = False) → AbsTokenizer
A helper function to instantiate a tokenizer based on the specified type.
This function creates an instance of a tokenizer based on the token_type provided. The function supports various tokenization methods, including BPE, Hugging Face, word, character, phoneme, and Whisper tokenizers.
espnet2.text.build_tokenizer.token_type
The type of tokenizer to instantiate. Must be one of: ‘bpe’, ‘hugging_face’, ‘word’, ‘char’, ‘phn’, or a whisper variant.
- Type: str
espnet2.text.build_tokenizer.bpemodel
The path to the BPE model file or a string/iterable for models requiring this parameter.
- Type: Optional[Union[Path, str, Iterable[str]]]
espnet2.text.build_tokenizer.non_linguistic_symbols
Symbols to be considered non-linguistic. Applicable for word and char tokenizers.
- Type: Optional[Union[Path, str, Iterable[str]]]
espnet2.text.build_tokenizer.remove_non_linguistic_symbols
If True, removes non-linguistic symbols from the tokenization process. Not implemented for BPE and Hugging Face.
- Type: bool
espnet2.text.build_tokenizer.space_symbol
The symbol used to represent spaces in the tokenization.
- Type: str
espnet2.text.build_tokenizer.delimiter
The delimiter used for tokenizing text.
- Type: Optional[str]
espnet2.text.build_tokenizer.g2p_type
Type of grapheme-to-phoneme conversion to use.
- Type: Optional[str]
espnet2.text.build_tokenizer.nonsplit_symbol
Symbols that should not be split.
- Type: Optional[Iterable[str]]
espnet2.text.build_tokenizer.encode_kwargs
Additional arguments for encoding (text to token).
- Type: Optional[Dict]
espnet2.text.build_tokenizer.whisper_language
Language to be used for Whisper tokenizer.
- Type: Optional[str]
espnet2.text.build_tokenizer.whisper_task
Task type for Whisper tokenizer (e.g., ‘transcribe’).
- Type: Optional[str]
espnet2.text.build_tokenizer.sot_asr
Whether to use start-of-transcript for ASR.
Type: bool
Parameters:
- token_type (str) – Type of the tokenizer to build.
- bpemodel (Optional *[*Union *[*Path , str , Iterable *[*str ] ] ]) – BPE model for BPE and Hugging Face tokenizers.
- non_linguistic_symbols (Optional *[*Union *[*Path , str , Iterable *[*str ] ] ]) – Non-linguistic symbols for word and char tokenizers.
- remove_non_linguistic_symbols (bool) – Flag to remove non-linguistic symbols.
- space_symbol (str) – Symbol for spaces.
- delimiter (Optional *[*str ]) – Delimiter for word tokenization.
- g2p_type (Optional *[*str ]) – G2P conversion type for phoneme tokenization.
- nonsplit_symbol (Optional *[*Iterable *[*str ] ]) – Symbols not to be split.
- encode_kwargs (Optional *[*Dict ]) – Encoding arguments.
- whisper_language (Optional *[*str ]) – Language for Whisper tokenization.
- whisper_task (Optional *[*str ]) – Task type for Whisper.
- sot_asr (bool) – Start-of-transcript for ASR.
Returns: An instance of a tokenizer based on the specified type.
Return type:AbsTokenizer
Raises:
- ValueError – If token_type is not recognized or if bpemodel is required but not provided for BPE or Hugging Face tokenizers.
- RuntimeError – If remove_non_linguistic_symbols is used with BPE or Hugging Face tokenizers.
Examples
>>> tokenizer = build_tokenizer("bpe", bpemodel="path/to/bpe.model")
>>> tokens = tokenizer.encode("Hello world!")
>>> print(tokens)
>>> tokenizer = build_tokenizer("word", non_linguistic_symbols=["@"])
>>> tokens = tokenizer.encode("Hello @ world!")
>>> print(tokens)
>>> tokenizer = build_tokenizer("whisper", bpemodel="path/to/model",
... whisper_language="en", whisper_task="transcribe")
>>> tokens = tokenizer.encode("Hello world!")
>>> print(tokens)