espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer
espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer
class espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer(model: Path | str)
Bases: AbsTokenizer
HuggingFaceTokenizer is a tokenizer that utilizes Hugging Face’s Transformers
library to tokenize and detokenize text.
This class is a subclass of AbsTokenizer and is designed to work with various Hugging Face models for natural language processing tasks. It builds the tokenizer lazily to avoid pickling issues when using multiprocessing.
model
The model name or path for the Hugging Face tokenizer.
- Type: str
tokenizer
The Hugging Face tokenizer instance, lazily initialized.
Type: AutoTokenizer
Parameters:model (Union *[*Path , str ]) – The model name or path to load the tokenizer.
Raises:ImportError – If the transformers library is not available.
######### Examples
>>> tokenizer = HuggingFaceTokenizer("bert-base-uncased")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> print(tokens)
['hello', ',', 'world', '!']
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
"Hello, world!"
NOTE
Ensure that the transformers library is installed. You can install it via pip install transformers or by following the installation steps for espnet.
text2tokens(line: str) → List[str]
Convert a given text line into a list of tokens using the Hugging Face
tokenizer.
This method initializes the tokenizer if it has not been built yet and then tokenizes the input text.
- Parameters:line (str) – The input text line to be tokenized.
- Returns: A list of tokens extracted from the input text.
- Return type: List[str]
######### Examples
>>> tokenizer = HuggingFaceTokenizer('bert-base-uncased')
>>> tokens = tokenizer.text2tokens("Hello, how are you?")
>>> print(tokens)
['hello', ',', 'how', 'are', 'you', '?']
- Raises:ValueError – If the input line is empty or None.
tokens2text(tokens: Iterable[str]) → str
Converts a list of tokens back into a text string using a Hugging Face tokenizer.
This method first ensures that the tokenizer is built, and then it converts the provided tokens into their corresponding text representation. It utilizes the Hugging Face’s batch_decode method to handle the conversion.
- Parameters:tokens (Iterable *[*str ]) – An iterable collection of tokens to be converted back into a text string.
- Returns: The reconstructed text string from the provided tokens.
- Return type: str
######### Examples
>>> tokenizer = HuggingFaceTokenizer("bert-base-uncased")
>>> tokens = ["hello", "world"]
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
"hello world"
NOTE
The method skips special tokens during decoding to ensure that the output text is clean and free of any unnecessary characters.
- Raises:ValueError – If the input tokens are invalid or cannot be decoded.