espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer

About 2 min

espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer

class espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer(model_type: str, language: str = 'en', task: str = 'transcribe', sot: bool = False, speaker_change_symbol: str = '<sc>', added_tokens_txt: str | None = None)

Bases: AbsTokenizer

A tokenizer for the OpenAI Whisper model.

This class handles the tokenization process for both transcribing and translating text using the OpenAI Whisper model. It supports various languages and can utilize additional tokens if specified.

model

The model type of the Whisper tokenizer.

Type: str

language

The language code for the tokenizer.

Type: str

task

The task to perform, either ‘transcribe’ or ‘translate’.

Type: str

tokenizer

The initialized tokenizer from the Whisper library.

Parameters:
- model_type (str) – The type of the Whisper model to use. Should be either “whisper_en” or “whisper_multilingual”.
- language (str) – The language code to use. Defaults to “en”.
- task (str) – The task to perform. Can be either “transcribe” or “translate”. Defaults to “transcribe”.
- sot (bool) – A flag indicating whether to include start-of-token symbols. Defaults to False.
- speaker_change_symbol (str) – The symbol to use for speaker changes. Defaults to “<sc>”.
- added_tokens_txt (Optional *[*str ]) – A path to a text file containing additional tokens to be added to the tokenizer.
Raises:ValueError – If the specified language or task is unsupported for the Whisper model.

######### Examples

>>> tokenizer = OpenAIWhisperTokenizer(
...     model_type="whisper_multilingual",
...     language="fr",
...     task="transcribe"
... )
>>> tokens = tokenizer.text2tokens("Bonjour, comment ça va?")
>>> text = tokenizer.tokens2text(tokens)

####### NOTE Ensure that the Whisper library is properly installed. If the library is not found, an error message will be printed, and an exception will be raised.

text2tokens(line: str) → List[str]

Convert a text line into a list of tokens.

This method utilizes the underlying tokenizer to tokenize the provided text line. It does not add any special tokens during the tokenization process.

Parameters:line (str) – The input text line to be tokenized.
Returns: A list of tokens generated from the input text line.
Return type: List[str]

######### Examples

>>> tokenizer = OpenAIWhisperTokenizer(model_type="whisper_en")
>>> tokens = tokenizer.text2tokens("Hello, how are you?")
>>> print(tokens)
['Hello', ',', 'how', 'are', 'you', '?']

####### NOTE This method assumes that the tokenizer has been properly initialized and is ready for use.

tokens2text(tokens: Iterable[str]) → str

A tokenizer for OpenAI’s Whisper model.

This tokenizer is responsible for converting text to tokens and vice versa, tailored for the specific requirements of the Whisper model.

model

The type of Whisper model being used.

Type: str

language

The language for tokenization, mapped from a code.

Type: str

task

The task for which the model is used (transcribe/translate).

Type: str

tokenizer

The actual tokenizer instance used for token conversion.

Parameters:
- model_type (str) – The model type, either “whisper_en” or “whisper_multilingual”.
- language (str , optional) – The language code for tokenization. Defaults to “en”.
- task (str , optional) – The task to perform with the model. Can be “transcribe” or “translate”. Defaults to “transcribe”.
- sot (bool , optional) – Whether to include start of transcription token. Defaults to False.
- speaker_change_symbol (str , optional) – Symbol to denote speaker changes. Defaults to “<sc>”.
- added_tokens_txt (Optional *[*str ] , optional) – Path to a text file containing additional tokens to add. Defaults to None.
Raises:ValueError – If an unsupported language or task is specified or if the tokenizer model type is unsupported.

######### Examples

>>> tokenizer = OpenAIWhisperTokenizer(model_type="whisper_en")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> text = tokenizer.tokens2text(tokens)
>>> print(text)  # Output: "Hello, world!"

####### NOTE Make sure to have the Whisper package installed properly. If not, an error will be raised during initialization.