espnet2.text.word_tokenizer.WordTokenizer
espnet2.text.word_tokenizer.WordTokenizer
class espnet2.text.word_tokenizer.WordTokenizer(delimiter: str | None = None, non_linguistic_symbols: Path | str | Iterable[str] | None = None, remove_non_linguistic_symbols: bool = False)
Bases: AbsTokenizer
A tokenizer that splits text into words based on a specified delimiter and can
optionally remove non-linguistic symbols.
delimiter
The character used to split the text into tokens.
- Type: Optional[str]
non_linguistic_symbols
A set of symbols that are considered non- linguistic and can be removed from the tokenized output.
- Type: set
remove_non_linguistic_symbols
A flag indicating whether to remove non-linguistic symbols from the tokenized output.
Type: bool
Parameters:
- delimiter (Optional *[*str ]) – The delimiter used for tokenization. If None, whitespace will be used as the default delimiter.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] , None ]) – A path to a file or an iterable containing non-linguistic symbols to be removed. If None, no symbols will be removed.
- remove_non_linguistic_symbols (bool) – If True, non-linguistic symbols will be removed from the tokenized output.
Raises:Warning – If non_linguistic_symbols is provided while remove_non_linguistic_symbols is False.
######### Examples
tokenizer = WordTokenizer(delimiter=”,”, non_linguistic_symbols=[“#”, “$”]) tokens = tokenizer.text2tokens(“Hello, world! # $”) print(tokens) # Output: [‘Hello’, ‘ world! ‘]
text = tokenizer.tokens2text(tokens) print(text) # Output: “Hello, world! “
text2tokens(line: str) → List[str]
Converts a given text line into a list of tokens based on a specified delimiter.
This method splits the input text line into tokens using the delimiter set during the initialization of the WordTokenizer instance. If the remove_non_linguistic_symbols attribute is set to True, any tokens that match the non-linguistic symbols will be excluded from the result.
- Parameters:line (str) – The input text line to be tokenized.
- Returns: A list of tokens extracted from the input text line.
- Return type: List[str]
######### Examples
>>> tokenizer = WordTokenizer(delimiter=' ')
>>> tokenizer.text2tokens("Hello world! This is a test.")
['Hello', 'world!', 'This', 'is', 'a', 'test.']
>>> tokenizer = WordTokenizer(delimiter=',',
... non_linguistic_symbols=['n/a'],
... remove_non_linguistic_symbols=True)
>>> tokenizer.text2tokens("value1,n/a,value2,value3")
['value1', 'value2', 'value3']
tokens2text(tokens: Iterable[str]) → str
Converts a list of tokens back into a text string using a specified delimiter.
This method takes an iterable of tokens and joins them into a single string, inserting the specified delimiter between each token. If no delimiter has been set during the initialization of the WordTokenizer, a space character will be used as the default delimiter.
- Parameters:tokens (Iterable *[*str ]) – An iterable containing the tokens to be joined.
- Returns: A string representing the joined tokens, separated by the specified delimiter.
- Return type: str
######### Examples
>>> tokenizer = WordTokenizer(delimiter=", ")
>>> tokens = ["Hello", "world", "!"]
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
Hello, world, !
>>> tokenizer_no_delimiter = WordTokenizer()
>>> tokens_no_delimiter = ["Hello", "world", "!"]
>>> text_no_delimiter = tokenizer_no_delimiter.tokens2text(tokens_no_delimiter)
>>> print(text_no_delimiter)
Hello world !