espnet2.text.abs_tokenizer.AbsTokenizer
espnet2.text.abs_tokenizer.AbsTokenizer
class espnet2.text.abs_tokenizer.AbsTokenizer
Bases: ABC
Abstract base class for tokenizers that convert text to tokens and vice versa.
This class defines the interface for tokenization, requiring subclasses to implement methods for converting text to tokens and tokens back to text.
None
- Parameters:None
- Returns: None
- Yields: None
- Raises:NotImplementedError – If the abstract methods are not implemented by a subclass.
######### Examples
class SimpleTokenizer(AbsTokenizer): : def text2tokens(self, line: str) -> List[str]: : return line.split() <br/> def tokens2text(self, tokens: Iterable[str]) -> str: : return ‘ ‘.join(tokens)
tokenizer = SimpleTokenizer() tokens = tokenizer.text2tokens(“Hello world”) text = tokenizer.tokens2text(tokens) print(tokens) # Output: [‘Hello’, ‘world’] print(text) # Output: ‘Hello world’
abstract text2tokens(line: str) → List[str]
Converts a given line of text into a list of tokens.
This method is intended to be implemented by subclasses of the AbsTokenizer class. It should define the logic for tokenizing the input text, which may involve splitting the text based on whitespace, punctuation, or other criteria, depending on the specific tokenizer implementation.
- Parameters:line (str) – The input line of text to be tokenized.
- Returns: A list of tokens extracted from the input line.
- Return type: List[str]
- Raises:
- NotImplementedError – If the method is called directly on an instance of
- AbsTokenizer –
######### Examples
Example usage of a concrete implementation:
class SimpleTokenizer(AbsTokenizer): : def text2tokens(self, line: str) -> List[str]: : return line.split()
tokenizer = SimpleTokenizer() tokens = tokenizer.text2tokens(“Hello, world!”) print(tokens) # Output: [‘Hello,’, ‘world!’]
abstract tokens2text(tokens: Iterable[str]) → str
Converts a list of tokens back into a text string.
This method takes an iterable of tokens and concatenates them into a single string. The tokens are typically generated by the text2tokens method and may need to be joined with spaces or other delimiters, depending on the specific tokenizer implementation.
None
- Parameters:
- tokens (Iterable *[*str ]) – An iterable containing tokens to be converted
- string. (back into a text)
- Returns: The reconstructed text string formed from the provided tokens.
- Return type: str
- Raises:NotImplementedError – If the method is not implemented in a subclass.
######### Examples
>>> tokenizer = MyTokenizer()
>>> tokens = ['Hello', 'world', '!']
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
Hello world!
NOTE
This is an abstract method and must be implemented in a subclass.