espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer

About 1 min

espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer

class espnet2.text.hugging_face_tokenizer.HuggingFaceTokenizer(model: Path | str)

Bases: AbsTokenizer

HuggingFaceTokenizer is a tokenizer that utilizes Hugging Face’s Transformers

library to tokenize and detokenize text.

This class is a subclass of AbsTokenizer and is designed to work with various Hugging Face models for natural language processing tasks. It builds the tokenizer lazily to avoid pickling issues when using multiprocessing.

model

The model name or path for the Hugging Face tokenizer.

Type: str

tokenizer

The Hugging Face tokenizer instance, lazily initialized.

Type: AutoTokenizer
Parameters:model (Union *[*Path , str ]) – The model name or path to load the tokenizer.
Raises:ImportError – If the transformers library is not available.

######### Examples

>>> tokenizer = HuggingFaceTokenizer("bert-base-uncased")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> print(tokens)
['hello', ',', 'world', '!']

>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
"Hello, world!"

NOTE

Ensure that the transformers library is installed. You can install it via pip install transformers or by following the installation steps for espnet.

text2tokens(line: str) → List[str]

Convert a given text line into a list of tokens using the Hugging Face

tokenizer.

This method initializes the tokenizer if it has not been built yet and then tokenizes the input text.

Parameters:line (str) – The input text line to be tokenized.
Returns: A list of tokens extracted from the input text.
Return type: List[str]

######### Examples

>>> tokenizer = HuggingFaceTokenizer('bert-base-uncased')
>>> tokens = tokenizer.text2tokens("Hello, how are you?")
>>> print(tokens)
['hello', ',', 'how', 'are', 'you', '?']

Raises:ValueError – If the input line is empty or None.

tokens2text(tokens: Iterable[str]) → str

Converts a list of tokens back into a text string using a Hugging Face tokenizer.

This method first ensures that the tokenizer is built, and then it converts the provided tokens into their corresponding text representation. It utilizes the Hugging Face’s batch_decode method to handle the conversion.

Parameters:tokens (Iterable *[*str ]) – An iterable collection of tokens to be converted back into a text string.
Returns: The reconstructed text string from the provided tokens.
Return type: str

######### Examples

>>> tokenizer = HuggingFaceTokenizer("bert-base-uncased")
>>> tokens = ["hello", "world"]
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
"hello world"

NOTE

The method skips special tokens during decoding to ensure that the output text is clean and free of any unnecessary characters.

Raises:ValueError – If the input tokens are invalid or cannot be decoded.