espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer
espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer
class espnet2.text.whisper_tokenizer.OpenAIWhisperTokenizer(model_type: str, language: str = 'en', task: str = 'transcribe', sot: bool = False, speaker_change_symbol: str = '<sc>', added_tokens_txt: str | None = None)
Bases: AbsTokenizer
A tokenizer for the OpenAI Whisper model.
This class handles the tokenization process for both transcribing and translating text using the OpenAI Whisper model. It supports various languages and can utilize additional tokens if specified.
model
The model type of the Whisper tokenizer.
- Type: str
language
The language code for the tokenizer.
- Type: str
task
The task to perform, either ‘transcribe’ or ‘translate’.
- Type: str
tokenizer
The initialized tokenizer from the Whisper library.
- Parameters:
- model_type (str) – The type of the Whisper model to use. Should be either “whisper_en” or “whisper_multilingual”.
- language (str) – The language code to use. Defaults to “en”.
- task (str) – The task to perform. Can be either “transcribe” or “translate”. Defaults to “transcribe”.
- sot (bool) – A flag indicating whether to include start-of-token symbols. Defaults to False.
- speaker_change_symbol (str) – The symbol to use for speaker changes. Defaults to “<sc>”.
- added_tokens_txt (Optional *[*str ]) – A path to a text file containing additional tokens to be added to the tokenizer.
- Raises:ValueError – If the specified language or task is unsupported for the Whisper model.
######### Examples
>>> tokenizer = OpenAIWhisperTokenizer(
... model_type="whisper_multilingual",
... language="fr",
... task="transcribe"
... )
>>> tokens = tokenizer.text2tokens("Bonjour, comment ça va?")
>>> text = tokenizer.tokens2text(tokens)
####### NOTE Ensure that the Whisper library is properly installed. If the library is not found, an error message will be printed, and an exception will be raised.
text2tokens(line: str) → List[str]
Convert a text line into a list of tokens.
This method utilizes the underlying tokenizer to tokenize the provided text line. It does not add any special tokens during the tokenization process.
- Parameters:line (str) – The input text line to be tokenized.
- Returns: A list of tokens generated from the input text line.
- Return type: List[str]
######### Examples
>>> tokenizer = OpenAIWhisperTokenizer(model_type="whisper_en")
>>> tokens = tokenizer.text2tokens("Hello, how are you?")
>>> print(tokens)
['Hello', ',', 'how', 'are', 'you', '?']
####### NOTE This method assumes that the tokenizer has been properly initialized and is ready for use.
tokens2text(tokens: Iterable[str]) → str
A tokenizer for OpenAI’s Whisper model.
This tokenizer is responsible for converting text to tokens and vice versa, tailored for the specific requirements of the Whisper model.
model
The type of Whisper model being used.
- Type: str
language
The language for tokenization, mapped from a code.
- Type: str
task
The task for which the model is used (transcribe/translate).
- Type: str
tokenizer
The actual tokenizer instance used for token conversion.
- Parameters:
- model_type (str) – The model type, either “whisper_en” or “whisper_multilingual”.
- language (str , optional) – The language code for tokenization. Defaults to “en”.
- task (str , optional) – The task to perform with the model. Can be “transcribe” or “translate”. Defaults to “transcribe”.
- sot (bool , optional) – Whether to include start of transcription token. Defaults to False.
- speaker_change_symbol (str , optional) – Symbol to denote speaker changes. Defaults to “<sc>”.
- added_tokens_txt (Optional *[*str ] , optional) – Path to a text file containing additional tokens to add. Defaults to None.
- Raises:ValueError – If an unsupported language or task is specified or if the tokenizer model type is unsupported.
######### Examples
>>> tokenizer = OpenAIWhisperTokenizer(model_type="whisper_en")
>>> tokens = tokenizer.text2tokens("Hello, world!")
>>> text = tokenizer.tokens2text(tokens)
>>> print(text) # Output: "Hello, world!"
####### NOTE Make sure to have the Whisper package installed properly. If not, an error will be raised during initialization.