espnet2.text.token_id_converter.TokenIDConverter

About 2 min

espnet2.text.token_id_converter.TokenIDConverter

class espnet2.text.token_id_converter.TokenIDConverter(token_list: Path | str | Iterable[str], unk_symbol: str = '<unk>')

Bases: object

A class to convert between tokens and their corresponding IDs.

This class facilitates the conversion of tokens to their integer IDs and vice versa. It takes a list of tokens, which can be provided as a file path, a string, or an iterable. It also allows for the specification of an unknown symbol to handle tokens that are not present in the provided token list.

token_list_repr

A string representation of the token list, showing the first few tokens and the total vocabulary size.

Type: str

token_list

A list of tokens.

Type: List[str]

token2id

A dictionary mapping tokens to their corresponding integer IDs.

Type: Dict[str, int]

unk_symbol

The symbol used for unknown tokens.

Type: str

unk_id

The integer ID corresponding to the unknown symbol.

Type: int
Parameters:
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – A list of tokens provided as a file path, string, or iterable.
- unk_symbol (str) – The symbol to represent unknown tokens. Defaults to “<unk>”.
Raises:RuntimeError – If a duplicate token is found in the token list or if the unknown symbol does not exist in the token list.

########### Examples

Using a file containing tokens

converter = TokenIDConverter(“path/to/token_list.txt”)

Using a list of tokens

converter = TokenIDConverter([“hello”, “world”, “<unk>”])

Getting the vocabulary size

vocab_size = converter.get_num_vocabulary_size()

Converting IDs to tokens

tokens = converter.ids2tokens(np.array([0, 1, 2]))

Converting tokens to IDs

ids = converter.tokens2ids([“hello”, “unknown_token”])

get_num_vocabulary_size() → int

Retrieves the size of the vocabulary, which is the number of unique tokens.

This method returns the total number of tokens stored in the token_list attribute of the TokenIDConverter class. It is useful for understanding the vocabulary size that the converter can work with.

Returns: The number of unique tokens in the vocabulary.
Return type: int

########### Examples

>>> converter = TokenIDConverter(["hello", "world", '&lt;unk&gt;'])
>>> converter.get_num_vocabulary_size()
3

NOTE

This method counts the number of tokens as they are stored in the token_list attribute. It does not account for any potential duplicates, as duplicates are not allowed during initialization.

ids2tokens(integers: ndarray | Iterable[int]) → List[str]

Converts a list of token IDs (integers) back into their corresponding tokens

(string representations) based on a predefined token list.

This method can handle both NumPy arrays and iterable collections of integers. If an integer does not correspond to any token in the token list, it will be skipped.

Parameters:integers (Union *[*np.ndarray , Iterable *[*int ] ]) – A 1-dimensional array or iterable containing the integer token IDs to convert to tokens.
Returns: A list of tokens corresponding to the provided integer IDs.
Return type: List[str]
Raises:ValueError – If the input integers is a NumPy array that is not 1-dimensional.

########### Examples

>>> converter = TokenIDConverter(['hello', 'world', '&lt;unk&gt;'])
>>> converter.ids2tokens([0, 1, 2])
['hello', 'world', '&lt;unk&gt;']

>>> converter.ids2tokens(np.array([0, 2]))
['hello', '&lt;unk&gt;']

>>> converter.ids2tokens(np.array([[0, 1]]))  # This will raise ValueError

tokens2ids(tokens: Iterable[str]) → List[int]

Converts tokens to their corresponding IDs using a predefined token list.

This method retrieves the ID for each token provided in the input iterable. If a token is not found in the token-to-ID mapping, it returns the ID for the unknown symbol.

Parameters:tokens (Iterable *[*str ]) – An iterable of tokens for which to retrieve IDs.
Returns: A list of corresponding IDs for the input tokens. If a token is not found, the ID for the unknown symbol is used.
Return type: List[int]

########### Examples

>>> converter = TokenIDConverter(["hello", "world", '&lt;unk&gt;'])
>>> converter.tokens2ids(["hello", "world", "unknown_token"])
[0, 1, 2]

>>> converter.tokens2ids(["hello", '&lt;unk&gt;'])
[0, 2]

NOTE

The unknown symbol must be part of the initial token list; otherwise, a RuntimeError will be raised during the initialization of the TokenIDConverter class.