espnet2.text.char_tokenizer.CharTokenizer
espnet2.text.char_tokenizer.CharTokenizer
class espnet2.text.char_tokenizer.CharTokenizer(non_linguistic_symbols: Path | str | Iterable[str] | None = None, space_symbol: str = '<space>', remove_non_linguistic_symbols: bool = False, nonsplit_symbols: Iterable[str] | None = None)
Bases: AbsTokenizer
CharTokenizer is a character-level tokenizer that converts text to tokens and vice versa.
This tokenizer handles non-linguistic symbols and allows customization of space representation. It can be used to preprocess text for various natural language processing tasks.
space_symbol
The symbol used to represent space in the tokenized output.
- Type: str
non_linguistic_symbols
A set of non-linguistic symbols that will be treated as individual tokens.
- Type: set
remove_non_linguistic_symbols
If True, non-linguistic symbols will be removed from the tokenized output.
- Type: bool
nonsplit_symbols
A set of symbols that will not be split when tokenizing.
Type: set
Parameters:
- non_linguistic_symbols (Optional *[*Union *[*Path , str , Iterable *[*str ] ] ]) – A path to a file or a list of non-linguistic symbols to be treated as individual tokens. Defaults to None.
- space_symbol (str) – The symbol used to represent space. Defaults to “<space>”.
- remove_non_linguistic_symbols (bool) – If True, removes non-linguistic symbols from the output. Defaults to False.
- nonsplit_symbols (Optional *[*Iterable *[*str ] ]) – A list of symbols that should not be split when tokenizing. Defaults to None.
######### Examples
tokenizer = CharTokenizer(non_linguistic_symbols=[“#”, “@”]) tokens = tokenizer.text2tokens(“Hello #World!”) print(tokens) # Output: [‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘#’, ‘W’, ‘o’, ‘r’, ‘l’, ‘d’, ‘!’] text = tokenizer.tokens2text(tokens) print(text) # Output: “Hello #World!”
- Raises:FileNotFoundError – If a specified file for non-linguistic symbols does not exist.
NOTE
This tokenizer is part of the ESPnet2 text processing module.
text2tokens(line: str) → List[str]
Converts a string of text into a list of tokens based on specified rules.
The text2tokens method processes the input string line and returns a list of tokens. It recognizes both linguistic and non-linguistic symbols, as well as handles space characters according to the defined space_symbol.
- Parameters:line (str) – The input string to be tokenized.
- Returns: A list of tokens extracted from the input string.
- Return type: List[str]
######### Examples
tokenizer = CharTokenizer(non_linguistic_symbols=[“@”, “#”], space_symbol=”<space>”) tokens = tokenizer.text2tokens(“Hello @world! How are you?”) print(tokens) # Output: [‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘<space>’, ‘@’, ‘w’, ‘o’, ‘r’, ‘l’, ‘d’, ‘!’, ‘<space>’, ‘H’, ‘o’, ‘w’, ‘<space>’, ‘a’, ‘r’, ‘e’, ‘<space>’, ‘y’, ‘o’, ‘u’, ‘?’]
NOTE
The behavior of this method is influenced by the remove_non_linguistic_symbols and nonsplit_symbols attributes set during the initialization of the CharTokenizer instance.
tokens2text(tokens: Iterable[str]) → str
Converts a sequence of tokens back into a single text string.
This method takes an iterable of tokens and transforms them into a single string representation. It replaces the specified space symbol with a space character to reconstruct the original text format.
- Parameters:tokens (Iterable *[*str ]) – An iterable containing tokens to be converted into text. The tokens can include a special space symbol that will be replaced with a regular space in the output.
- Returns: The reconstructed text string derived from the input tokens.
- Return type: str
######### Examples
>>> tokenizer = CharTokenizer()
>>> tokens = ['H', 'e', 'l', 'l', 'o', '<space>', 'W', 'o', 'r', 'l', 'd']
>>> tokenizer.tokens2text(tokens)
'Hello World'
>>> tokens = ['T', 'h', 'i', 's', '<space>', 'i', 's', '<space>', 'a',
... 'n', ' ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', '.']
>>> tokenizer.tokens2text(tokens)
'This is an example.'