espnet2.text.phoneme_tokenizer.PhonemeTokenizer

About 3 min

espnet2.text.phoneme_tokenizer.PhonemeTokenizer

class espnet2.text.phoneme_tokenizer.PhonemeTokenizer(g2p_type: None | str, non_linguistic_symbols: None | Path | str | Iterable[str] = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False)

Bases: AbsTokenizer

A tokenizer that converts text into phonemes using various G2P methods.

This class is designed to handle text-to-phoneme (G2P) conversion for different languages and dialects. It supports multiple G2P backends and allows customization for handling non-linguistic symbols.

g2p_type

The type of G2P method to use for phoneme conversion.

Type: str

space_symbol

The symbol used to represent spaces in tokenized output.

Type: str

non_linguistic_symbols

A set of symbols to handle separately during tokenization.

Type: set

remove_non_linguistic_symbols

Flag to determine whether to remove non-linguistic symbols from the output.

Type: bool
Parameters:
- g2p_type (Union *[*None , str ]) – The G2P method to use. If None, a simple space-based tokenizer is used.
- non_linguistic_symbols (Union *[*None , Path , str , Iterable *[*str ] ]) – A collection of non-linguistic symbols to handle.
- space_symbol (str) – Symbol to use for spaces in tokenized output. Default is “<space>”.
- remove_non_linguistic_symbols (bool) – Whether to remove non-linguistic symbols from the output. Default is False.
Raises:NotImplementedError – If an unsupported G2P type is provided.

########### Examples

>>> tokenizer = PhonemeTokenizer(g2p_type="g2p_en")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> print(tokens)
['HH', 'AH', 'L', 'OW', ',', 'W', 'ER', 'L', 'D', '!']

>>> tokenizer = PhonemeTokenizer(g2p_type="g2pk",
...                                non_linguistic_symbols=["!"])
>>> tokens = tokenizer.text2tokens("Hello! World!")
>>> print(tokens)
['HH', 'AH', 'L', 'OW', '!', 'W', 'ER', 'L', 'D', '!']

######## NOTE The G2P methods used are dependent on the installation of corresponding libraries (e.g., g2p_en, g2pk, etc.). Make sure to install the necessary packages to utilize specific G2P methods.

text2tokens(line: str) → List[str]

Converts input text to a list of tokens (phonemes).

This method processes the input string line by extracting any non-linguistic symbols specified during the initialization of the PhonemeTokenizer and then applying the configured G2P (grapheme-to-phoneme) model to convert the remaining text into phonemes.

non_linguistic_symbols

A set of non-linguistic symbols to be recognized and handled during tokenization.

Type: set

remove_non_linguistic_symbols

Flag indicating whether to remove non-linguistic symbols from the output tokens.

Type: bool
Parameters:line (str) – The input text string to be tokenized.
Returns: A list of tokens (phonemes) generated from the input text.
Return type: List[str]

########### Examples

>>> from phoneme_tokenizer import PhonemeTokenizer
>>> tokenizer = PhonemeTokenizer(g2p_type="g2p_en")
>>> tokenizer.text2tokens("Hello, world!")
['H', 'ə', 'l', 'oʊ', ' ', 'w', 'ɜ', 'r', 'l', 'd', '!']

######## NOTE The method processes the input line in a loop, checking for non-linguistic symbols at the beginning of the line. If found, it appends the symbol to the token list (if not set to remove) and continues processing the rest of the line. After handling all symbols, it applies the G2P model to the remaining text.

text2tokens_svs(syllable: str) → List[str]

Converts a given syllable into its corresponding phonetic tokens.

This method handles specific syllables by returning predefined token mappings from a custom dictionary. If the provided syllable is not in the dictionary, it defaults to using the general g2p (grapheme-to-phoneme) conversion method.

######## NOTE If needed, the customed_dic can be modified to include additional mappings as required.

Parameters:
- syllable (str) – The input syllable to be converted into phonetic
- tokens.
Returns: A list of phonetic tokens corresponding to the input syllable.
Return type: List[str]

########### Examples

>>> tokenizer = PhonemeTokenizer(g2p_type="pyopenjtalk")
>>> tokenizer.text2tokens_svs("は")
['h', 'a']
>>> tokenizer.text2tokens_svs("シ")
['sh', 'I']
>>> tokenizer.text2tokens_svs("くぁ")
['k', 'w', 'a']

tokens2text(tokens: Iterable[str]) → str

Tokenizes text into phonemes using various g2p methods.

This class serves as a tokenizer that converts text into phoneme tokens based on the specified g2p (grapheme-to-phoneme) method. It supports multiple g2p implementations, allowing for flexibility in phoneme generation for different languages and dialects.

g2p_type

The type of g2p method to use.

Type: Union[None, str]

space_symbol

The symbol to use for spaces in tokens.

Type: str

non_linguistic_symbols

A set of non-linguistic symbols to handle.

Type: set

remove_non_linguistic_symbols

Whether to remove non-linguistic symbols.

Type: bool
Parameters:
- g2p_type (Union *[*None , str ]) – The g2p method to be used.
- non_linguistic_symbols (Union *[*None , Path , str , Iterable *[*str ] ]) – Symbols that are not linguistic in nature.
- space_symbol (str) – The symbol used to represent spaces in tokens.
- remove_non_linguistic_symbols (bool) – Whether to remove non-linguistic symbols from the output.
Raises:NotImplementedError – If the specified g2p_type is not supported.

########### Examples

>>> from espnet2.text.phoneme_tokenizer import PhonemeTokenizer
>>> tokenizer = PhonemeTokenizer(g2p_type="g2p_en")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> print(tokens)
['H', 'ə', 'l', 'oʊ', ',', ' ', 'w', 'ɜ', 'r', 'l', 'd', '!']

######## NOTE The tokenizer’s behavior can be modified by adjusting the non_linguistic_symbols and remove_non_linguistic_symbols attributes.