espnet2.text.abs_tokenizer.AbsTokenizer

About 1 min

espnet2.text.abs_tokenizer.AbsTokenizer

class espnet2.text.abs_tokenizer.AbsTokenizer

Bases: ABC

Abstract base class for tokenizers that convert text to tokens and vice versa.

This class defines the interface for tokenization, requiring subclasses to implement methods for converting text to tokens and tokens back to text.

None

Parameters:None
Returns: None
Yields: None
Raises:NotImplementedError – If the abstract methods are not implemented by a subclass.

######### Examples

class SimpleTokenizer(AbsTokenizer): : def text2tokens(self, line: str) -> List[str]: : return line.split() <br/> def tokens2text(self, tokens: Iterable[str]) -> str: : return ‘ ‘.join(tokens)

tokenizer = SimpleTokenizer() tokens = tokenizer.text2tokens(“Hello world”) text = tokenizer.tokens2text(tokens) print(tokens) # Output: [‘Hello’, ‘world’] print(text) # Output: ‘Hello world’

abstract text2tokens(line: str) → List[str]

Converts a given line of text into a list of tokens.

This method is intended to be implemented by subclasses of the AbsTokenizer class. It should define the logic for tokenizing the input text, which may involve splitting the text based on whitespace, punctuation, or other criteria, depending on the specific tokenizer implementation.

Parameters:line (str) – The input line of text to be tokenized.
Returns: A list of tokens extracted from the input line.
Return type: List[str]
Raises:
- NotImplementedError – If the method is called directly on an instance of
- AbsTokenizer –

######### Examples

Example usage of a concrete implementation:

class SimpleTokenizer(AbsTokenizer): : def text2tokens(self, line: str) -> List[str]: : return line.split()

tokenizer = SimpleTokenizer() tokens = tokenizer.text2tokens(“Hello, world!”) print(tokens) # Output: [‘Hello,’, ‘world!’]

abstract tokens2text(tokens: Iterable[str]) → str

Converts a list of tokens back into a text string.

This method takes an iterable of tokens and concatenates them into a single string. The tokens are typically generated by the text2tokens method and may need to be joined with spaces or other delimiters, depending on the specific tokenizer implementation.

None

Parameters:
- tokens (Iterable *[*str ]) – An iterable containing tokens to be converted
- string. (back into a text)
Returns: The reconstructed text string formed from the provided tokens.
Return type: str
Raises:NotImplementedError – If the method is not implemented in a subclass.

######### Examples

>>> tokenizer = MyTokenizer()
>>> tokens = ['Hello', 'world', '!']
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
Hello world!

NOTE

This is an abstract method and must be implemented in a subclass.