espnet2.text.word_tokenizer.WordTokenizer

About 2 min

espnet2.text.word_tokenizer.WordTokenizer

class espnet2.text.word_tokenizer.WordTokenizer(delimiter: str | None = None, non_linguistic_symbols: Path | str | Iterable[str] | None = None, remove_non_linguistic_symbols: bool = False)

Bases: AbsTokenizer

A tokenizer that splits text into words based on a specified delimiter and can

optionally remove non-linguistic symbols.

delimiter

The character used to split the text into tokens.

Type: Optional[str]

non_linguistic_symbols

A set of symbols that are considered non- linguistic and can be removed from the tokenized output.

Type: set

remove_non_linguistic_symbols

A flag indicating whether to remove non-linguistic symbols from the tokenized output.

Type: bool
Parameters:
- delimiter (Optional *[*str ]) – The delimiter used for tokenization. If None, whitespace will be used as the default delimiter.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] , None ]) – A path to a file or an iterable containing non-linguistic symbols to be removed. If None, no symbols will be removed.
- remove_non_linguistic_symbols (bool) – If True, non-linguistic symbols will be removed from the tokenized output.
Raises:Warning – If non_linguistic_symbols is provided while remove_non_linguistic_symbols is False.

######### Examples

tokenizer = WordTokenizer(delimiter=”,”, non_linguistic_symbols=[“#”, “$”]) tokens = tokenizer.text2tokens(“Hello, world! # $”) print(tokens) # Output: [‘Hello’, ‘ world! ‘]

text = tokenizer.tokens2text(tokens) print(text) # Output: “Hello, world! “

text2tokens(line: str) → List[str]

Converts a given text line into a list of tokens based on a specified delimiter.

This method splits the input text line into tokens using the delimiter set during the initialization of the WordTokenizer instance. If the remove_non_linguistic_symbols attribute is set to True, any tokens that match the non-linguistic symbols will be excluded from the result.

Parameters:line (str) – The input text line to be tokenized.
Returns: A list of tokens extracted from the input text line.
Return type: List[str]

######### Examples

>>> tokenizer = WordTokenizer(delimiter=' ')
>>> tokenizer.text2tokens("Hello world! This is a test.")
['Hello', 'world!', 'This', 'is', 'a', 'test.']

>>> tokenizer = WordTokenizer(delimiter=',',
...                           non_linguistic_symbols=['n/a'],
...                           remove_non_linguistic_symbols=True)
>>> tokenizer.text2tokens("value1,n/a,value2,value3")
['value1', 'value2', 'value3']

tokens2text(tokens: Iterable[str]) → str

Converts a list of tokens back into a text string using a specified delimiter.

This method takes an iterable of tokens and joins them into a single string, inserting the specified delimiter between each token. If no delimiter has been set during the initialization of the WordTokenizer, a space character will be used as the default delimiter.

Parameters:tokens (Iterable *[*str ]) – An iterable containing the tokens to be joined.
Returns: A string representing the joined tokens, separated by the specified delimiter.
Return type: str

######### Examples

>>> tokenizer = WordTokenizer(delimiter=", ")
>>> tokens = ["Hello", "world", "!"]
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
Hello, world, !

>>> tokenizer_no_delimiter = WordTokenizer()
>>> tokens_no_delimiter = ["Hello", "world", "!"]
>>> text_no_delimiter = tokenizer_no_delimiter.tokens2text(tokens_no_delimiter)
>>> print(text_no_delimiter)
Hello world !