espnet2.text.char_tokenizer.CharTokenizer

About 2 min

espnet2.text.char_tokenizer.CharTokenizer

class espnet2.text.char_tokenizer.CharTokenizer(non_linguistic_symbols: Path | str | Iterable[str] | None = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False, nonsplit_symbols: Iterable[str] | None = None)

Bases: AbsTokenizer

CharTokenizer is a character-level tokenizer that converts text to tokens and vice versa.

This tokenizer handles non-linguistic symbols and allows customization of space representation. It can be used to preprocess text for various natural language processing tasks.

space_symbol

The symbol used to represent space in the tokenized output.

Type: str

non_linguistic_symbols

A set of non-linguistic symbols that will be treated as individual tokens.

Type: set

remove_non_linguistic_symbols

If True, non-linguistic symbols will be removed from the tokenized output.

Type: bool

nonsplit_symbols

A set of symbols that will not be split when tokenizing.

Type: set
Parameters:
- non_linguistic_symbols (Optional *[*Union *[*Path , str , Iterable *[*str ] ] ]) – A path to a file or a list of non-linguistic symbols to be treated as individual tokens. Defaults to None.
- space_symbol (str) – The symbol used to represent space. Defaults to “<space>”.
- remove_non_linguistic_symbols (bool) – If True, removes non-linguistic symbols from the output. Defaults to False.
- nonsplit_symbols (Optional *[*Iterable *[*str ] ]) – A list of symbols that should not be split when tokenizing. Defaults to None.

######### Examples

tokenizer = CharTokenizer(non_linguistic_symbols=[“#”, “@”]) tokens = tokenizer.text2tokens(“Hello #World!”) print(tokens) # Output: [‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘#’, ‘W’, ‘o’, ‘r’, ‘l’, ‘d’, ‘!’] text = tokenizer.tokens2text(tokens) print(text) # Output: “Hello #World!”

Raises:FileNotFoundError – If a specified file for non-linguistic symbols does not exist.

NOTE

This tokenizer is part of the ESPnet2 text processing module.

text2tokens(line: str) → List[str]

Converts a string of text into a list of tokens based on specified rules.

The text2tokens method processes the input string line and returns a list of tokens. It recognizes both linguistic and non-linguistic symbols, as well as handles space characters according to the defined space_symbol.

Parameters:line (str) – The input string to be tokenized.
Returns: A list of tokens extracted from the input string.
Return type: List[str]

######### Examples

tokenizer = CharTokenizer(non_linguistic_symbols=[“@”, “#”], space_symbol=”<space>”) tokens = tokenizer.text2tokens(“Hello @world! How are you?”) print(tokens) # Output: [‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘<space>’, ‘@’, ‘w’, ‘o’, ‘r’, ‘l’, ‘d’, ‘!’, ‘<space>’, ‘H’, ‘o’, ‘w’, ‘<space>’, ‘a’, ‘r’, ‘e’, ‘<space>’, ‘y’, ‘o’, ‘u’, ‘?’]

NOTE

The behavior of this method is influenced by the remove_non_linguistic_symbols and nonsplit_symbols attributes set during the initialization of the CharTokenizer instance.

tokens2text(tokens: Iterable[str]) → str

Converts a sequence of tokens back into a single text string.

This method takes an iterable of tokens and transforms them into a single string representation. It replaces the specified space symbol with a space character to reconstruct the original text format.

Parameters:tokens (Iterable *[*str ]) – An iterable containing tokens to be converted into text. The tokens can include a special space symbol that will be replaced with a regular space in the output.
Returns: The reconstructed text string derived from the input tokens.
Return type: str

######### Examples

>>> tokenizer = CharTokenizer()
>>> tokens = ['H', 'e', 'l', 'l', 'o', '&lt;space&gt;', 'W', 'o', 'r', 'l', 'd']
>>> tokenizer.tokens2text(tokens)
'Hello World'

>>> tokens = ['T', 'h', 'i', 's', '&lt;space&gt;', 'i', 's', '&lt;space&gt;', 'a',
... 'n', ' ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '.']
>>> tokenizer.tokens2text(tokens)
'This is an example.'