espnet2.text.token_id_converter.TokenIDConverter
espnet2.text.token_id_converter.TokenIDConverter
class espnet2.text.token_id_converter.TokenIDConverter(token_list: Path | str | Iterable[str], unk_symbol: str = '<unk>')
Bases: object
A class to convert between tokens and their corresponding IDs.
This class facilitates the conversion of tokens to their integer IDs and vice versa. It takes a list of tokens, which can be provided as a file path, a string, or an iterable. It also allows for the specification of an unknown symbol to handle tokens that are not present in the provided token list.
token_list_repr
A string representation of the token list, showing the first few tokens and the total vocabulary size.
- Type: str
token_list
A list of tokens.
- Type: List[str]
token2id
A dictionary mapping tokens to their corresponding integer IDs.
- Type: Dict[str, int]
unk_symbol
The symbol used for unknown tokens.
- Type: str
unk_id
The integer ID corresponding to the unknown symbol.
Type: int
Parameters:
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – A list of tokens provided as a file path, string, or iterable.
- unk_symbol (str) – The symbol to represent unknown tokens. Defaults to “<unk>”.
Raises:RuntimeError – If a duplicate token is found in the token list or if the unknown symbol does not exist in the token list.
########### Examples
Using a file containing tokens
converter = TokenIDConverter(“path/to/token_list.txt”)
Using a list of tokens
converter = TokenIDConverter([“hello”, “world”, “<unk>”])
Getting the vocabulary size
vocab_size = converter.get_num_vocabulary_size()
Converting IDs to tokens
tokens = converter.ids2tokens(np.array([0, 1, 2]))
Converting tokens to IDs
ids = converter.tokens2ids([“hello”, “unknown_token”])
get_num_vocabulary_size() → int
Retrieves the size of the vocabulary, which is the number of unique tokens.
This method returns the total number of tokens stored in the token_list attribute of the TokenIDConverter class. It is useful for understanding the vocabulary size that the converter can work with.
- Returns: The number of unique tokens in the vocabulary.
- Return type: int
########### Examples
>>> converter = TokenIDConverter(["hello", "world", '<unk>'])
>>> converter.get_num_vocabulary_size()
3
NOTE
This method counts the number of tokens as they are stored in the token_list attribute. It does not account for any potential duplicates, as duplicates are not allowed during initialization.
ids2tokens(integers: ndarray | Iterable[int]) → List[str]
Converts a list of token IDs (integers) back into their corresponding tokens
(string representations) based on a predefined token list.
This method can handle both NumPy arrays and iterable collections of integers. If an integer does not correspond to any token in the token list, it will be skipped.
- Parameters:integers (Union *[*np.ndarray , Iterable *[*int ] ]) – A 1-dimensional array or iterable containing the integer token IDs to convert to tokens.
- Returns: A list of tokens corresponding to the provided integer IDs.
- Return type: List[str]
- Raises:ValueError – If the input integers is a NumPy array that is not 1-dimensional.
########### Examples
>>> converter = TokenIDConverter(['hello', 'world', '<unk>'])
>>> converter.ids2tokens([0, 1, 2])
['hello', 'world', '<unk>']
>>> converter.ids2tokens(np.array([0, 2]))
['hello', '<unk>']
>>> converter.ids2tokens(np.array([[0, 1]])) # This will raise ValueError
tokens2ids(tokens: Iterable[str]) → List[int]
Converts tokens to their corresponding IDs using a predefined token list.
This method retrieves the ID for each token provided in the input iterable. If a token is not found in the token-to-ID mapping, it returns the ID for the unknown symbol.
- Parameters:tokens (Iterable *[*str ]) – An iterable of tokens for which to retrieve IDs.
- Returns: A list of corresponding IDs for the input tokens. If a token is not found, the ID for the unknown symbol is used.
- Return type: List[int]
########### Examples
>>> converter = TokenIDConverter(["hello", "world", '<unk>'])
>>> converter.tokens2ids(["hello", "world", "unknown_token"])
[0, 1, 2]
>>> converter.tokens2ids(["hello", '<unk>'])
[0, 2]
NOTE
The unknown symbol must be part of the initial token list; otherwise, a RuntimeError will be raised during the initialization of the TokenIDConverter class.