espnet2.text.sentencepiece_tokenizer.SentencepiecesTokenizer

About 2 min

espnet2.text.sentencepiece_tokenizer.SentencepiecesTokenizer

class espnet2.text.sentencepiece_tokenizer.SentencepiecesTokenizer(model: Path | str, encode_kwargs: Dict = {})

Bases: AbsTokenizer

Tokenizer that utilizes SentencePiece for tokenization and detokenization.

This class inherits from AbsTokenizer and provides methods to convert text to tokens and vice versa using the SentencePiece model specified during initialization. It lazily loads the SentencePiece processor to avoid issues with pickling, which can occur when using multiprocessing.

model

The path to the SentencePiece model file.

Type: str

encode_kwargs

Additional keyword arguments for encoding.

Type: Dict
Parameters:
- model (Union *[*Path , str ]) – The path to the SentencePiece model file.
- encode_kwargs (Dict , optional) – Additional keyword arguments for the EncodeAsPieces method. Defaults to an empty dictionary.

######### Examples

>>> tokenizer = SentencepiecesTokenizer("path/to/model.model")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> print(tokens)
['Hello', ',', '▁world', '!']
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
'Hello, world!'

Raises:ValueError – If the model file does not exist or cannot be loaded.

####### NOTE The SentencePiece model must be trained and available at the specified path. Ensure that the model is compatible with the expected tokenization strategy.

text2tokens(line: str) → List[str]

Converts a given text line into a list of tokens using the SentencePiece model.

This method utilizes the SentencePieceProcessor to encode the input text line into a sequence of tokens (pieces). It is important to ensure that the SentencePiece model is properly loaded before calling this method, which is handled internally.

Parameters:line (str) – The input text line that needs to be tokenized.
Returns: A list of tokens (pieces) generated from the input text line.
Return type: List[str]

######### Examples

tokenizer = SentencepiecesTokenizer(model=”path/to/model”) tokens = tokenizer.text2tokens(“This is an example sentence.”) print(tokens) # Output might look like: [‘This’, ‘is’, ‘an’, ‘example’, ‘sentence’, ‘.’]

####### NOTE Ensure that the SentencePiece model file exists and is accessible at the specified path during initialization of the tokenizer.

Raises:Exception – Raises an exception if the SentencePiece model cannot be loaded.

tokens2text(tokens: Iterable[str]) → str

Converts a sequence of tokens back into text using the SentencePiece model.

This method requires that the SentencePieceProcessor is built, which is done lazily when this method is called. It takes an iterable of tokens and decodes them into a single string.

Parameters:tokens (Iterable *[*str ]) – An iterable containing the tokens to be decoded.
Returns: The decoded text corresponding to the input tokens.
Return type: str

######### Examples

>>> tokenizer = SentencepiecesTokenizer("model_file.model")
>>> tokens = ["▁Hello", "▁world", "!"]
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)
"Hello world!"

####### NOTE Ensure that the SentencePiece model is properly loaded before calling this method, as it relies on the model to decode the tokens.

Raises:ValueError – If the input tokens are invalid or cannot be decoded.