espnet2.text.hugging_face_token_id_converter.HuggingFaceTokenIDConverter

About 2 min

espnet2.text.hugging_face_token_id_converter.HuggingFaceTokenIDConverter

class espnet2.text.hugging_face_token_id_converter.HuggingFaceTokenIDConverter(model_name_or_path: str)

Bases: object

A converter class for transforming between token IDs and tokens using the

Hugging Face Transformers library.

This class provides methods to convert between token IDs and their corresponding tokens as well as to retrieve the vocabulary size of a specified model. It requires the transformers library to be installed.

tokenizer

An instance of AutoTokenizer initialized with the specified

model.

Parameters:
- model_name_or_path (str) – The name or path of the pre-trained model to
- from. (load the tokenizer)
Raises:ImportError – If the transformers library is not available.

########### Examples

>>> converter = HuggingFaceTokenIDConverter('bert-base-uncased')
>>> vocab_size = converter.get_num_vocabulary_size()
>>> token_ids = converter.tokens2ids(['hello', 'world'])
>>> tokens = converter.ids2tokens(token_ids)

####### NOTE Ensure that the model specified has a compatible tokenizer available.

get_num_vocabulary_size() → int

Returns the size of the vocabulary used by the tokenizer.

This method accesses the vocab_size attribute of the tokenizer, which is initialized with a specified model. The vocabulary size indicates the total number of unique tokens that the tokenizer can recognize.

Returns: The size of the vocabulary.
Return type: int

########### Examples

>>> converter = HuggingFaceTokenIDConverter('bert-base-uncased')
>>> vocab_size = converter.get_num_vocabulary_size()
>>> print(vocab_size)
30522  # Example size for BERT tokenizer

ids2tokens(integers: ndarray | Iterable[int]) → List[str]

Converts a list of token IDs into their corresponding token strings using the

Hugging Face tokenizer.

This method is useful for translating numeric representations of tokens back into their readable string format, allowing for better interpretation of model outputs.

Parameters:integers (Union *[*np.ndarray , Iterable *[*int ] ]) – A collection of token IDs (integers) to be converted into tokens. This can be a NumPy array or any iterable containing integers.
Returns: A list of tokens corresponding to the input token IDs.
Return type: List[str]

########### Examples

>>> converter = HuggingFaceTokenIDConverter('bert-base-uncased')
>>> token_ids = [101, 7592, 102]
>>> tokens = converter.ids2tokens(token_ids)
>>> print(tokens)
['[CLS]', 'hello', '[SEP]']

####### NOTE Ensure that the input integers are valid token IDs for the specified tokenizer. Invalid IDs may result in unexpected tokens or errors.

tokens2ids(tokens: Iterable[str]) → List[int]

Converts tokens to their corresponding IDs using a Hugging Face tokenizer.

This method takes an iterable of tokens (strings) and returns a list of integers representing the corresponding token IDs as defined by the Hugging Face tokenizer.

Parameters:tokens (Iterable *[*str ]) – An iterable containing the tokens to be converted to IDs.
Returns: A list of integers representing the token IDs corresponding : to the provided tokens.
Return type: List[int]

########### Examples

>>> converter = HuggingFaceTokenIDConverter("bert-base-uncased")
>>> tokens = ["hello", "world"]
>>> ids = converter.tokens2ids(tokens)
>>> print(ids)  # Output: [7592, 2088] (IDs may vary based on the model)

####### NOTE Ensure that the tokens are valid for the tokenizer used. Invalid tokens may result in unexpected behavior or errors.