espnet2.asr.encoder.whisper_encoder.OpenAIWhisperEncoder

About 5 min

espnet2.asr.encoder.whisper_encoder.OpenAIWhisperEncoder

class espnet2.asr.encoder.whisper_encoder.OpenAIWhisperEncoder(input_size: int = 1, dropout_rate: float = 0.0, whisper_model: str = 'small', download_dir: str | None = None, use_specaug: bool = False, specaug_conf: dict | None = None, do_pad_trim: bool = False)

Bases: AbsEncoder

Transformer-based Speech Encoder from OpenAI’s Whisper Model.

This encoder leverages the Whisper model for speech recognition tasks. It processes audio inputs to generate log-mel spectrograms and encodes them using a series of transformer blocks.

For more information on the Whisper model, visit: https://github.com/openai/whisper

n_fft

Number of FFT components.

Type: int

win_length

Window length for STFT.

Type: int

hop_length

Hop length for STFT.

Type: int

n_mels

Number of mel frequency bins.

Type: int

mel_filters

Mel filter bank.

Type: torch.Tensor

dropout

Dropout layer for regularization.

Type: torch.nn.Dropout

encoders

Deep copy of the Whisper model encoder.

Type: torch.nn.Module

specaug

SpecAugment instance for data augmentation.

Type:SpecAug

do_pad_trim

Flag to indicate if padding/trimming is applied.

Type: bool

pad_samples

Number of samples to pad/trim to.

Type: int
Parameters:
- input_size (int) – Size of the input audio feature vector. Default is 1.
- dropout_rate (float) – Dropout rate for the encoder. Default is 0.0.
- whisper_model (str) – Name of the Whisper model to use. Default is “small”.
- download_dir (Optional *[*str ]) – Directory to download the model. Default is None.
- use_specaug (bool) – Flag to use SpecAugment. Default is False.
- specaug_conf (Union *[*dict , None ]) – Configuration for SpecAugment. Default is None.
- do_pad_trim (bool) – Flag to enable padding or trimming of inputs. Default is False.
Raises:ImportError – If the Whisper library is not installed properly.

############### Examples

>>> encoder = OpenAIWhisperEncoder(whisper_model="base")
>>> audio_input = torch.randn(1, 32000)  # Example audio input
>>> ilens = torch.tensor([32000])  # Input lengths
>>> encoded_output, olens, _ = encoder(audio_input, ilens)
>>> print(encoded_output.shape)  # Shape of the encoded output

######### NOTE The Whisper model does not originally use dropout. However, a dropout layer can be specified for regularization during training.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Perform a forward pass through the OpenAI Whisper Encoder.

This method processes the input audio tensor, applies log-mel spectrogram transformation, and encodes the features using the Whisper model. It also handles optional padding/trimming and spec augmentation if enabled.

Parameters:
- xs_pad (torch.Tensor) – Input audio tensor of shape (B, T, C), where B is the batch size, T is the sequence length, and C is the number of channels.
- ilens (torch.Tensor) – Tensor of shape (B,) containing the lengths of the input sequences before padding.
- prev_states (torch.Tensor , optional) – Previous states from the encoder, default is None.
Returns:
- Processed audio tensor after encoding of shape (B, T’, C).
- Output lengths tensor of shape (B,) indicating the lengths of the output sequences.
- Optional tensor of None for compatibility with other models.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

######### NOTE The input audio tensor may be padded or trimmed to a fixed length defined by self.pad_samples if self.do_pad_trim is set to True.

############### Examples

>>> encoder = OpenAIWhisperEncoder()
>>> audio_input = torch.randn(2, 16000)  # Batch of 2 audio samples
>>> input_lengths = torch.tensor([16000, 16000])
>>> output, output_lengths, _ = encoder.forward(audio_input, input_lengths)
>>> print(output.shape)  # Expected shape: (2, T', C)

Raises:ValueError – If the input tensor xs_pad is not of shape (B, T, C).

log_mel_spectrogram(audio: Tensor, ilens: Tensor | None = None) → Tensor

Computes the log-mel spectrogram of the input audio tensor using the native Whisper training method.

This method first applies a Short-Time Fourier Transform (STFT) to the audio input, computes the mel spectrogram using mel filters, and then transforms the mel spectrogram into a log scale. The resulting log-mel spectrogram is used for further processing in the Whisper encoder.

Parameters:
- audio (torch.Tensor) – A tensor containing the audio waveform. The shape should be (batch_size, num_samples).
- ilens (torch.Tensor , optional) – A tensor containing the lengths of each audio sample in the batch. If provided, it is used to compute the output lengths. The shape should be (batch_size,).
Returns: A tensor containing the log-mel spectrogram of the : input audio, with shape (batch_size, n_mels, n_frames).
torch.Tensor or None: A tensor containing the output lengths of : the log-mel spectrogram if ilens is provided, otherwise None.
Return type: torch.Tensor

############### Examples

>>> encoder = OpenAIWhisperEncoder()
>>> audio_input = torch.randn(2, 16000)  # Batch of 2 audio samples
>>> ilens = torch.tensor([16000, 16000])  # Lengths of audio samples
>>> log_mel_spec, output_lengths = encoder.log_mel_spectrogram(audio_input, ilens)
>>> log_mel_spec.shape
torch.Size([2, 80, 201])  # Example output shape for n_mels=80

######### NOTE The STFT is computed with a Hann window and the last frame is removed as per Whisper’s implementation.

output_size() → int

Returns the output size of the encoder.

This function retrieves the output size of the encoder, which is determined by the last normalized shape of the layer following the encoder’s blocks. It is useful for understanding the dimensionality of the output tensor produced by the encoding process.

Returns: The size of the output from the encoder.
Return type: int

############### Examples

>>> encoder = OpenAIWhisperEncoder()
>>> output_size = encoder.output_size()
>>> print(output_size)
768  # Example output size depending on the model used.

pad_or_trim(array: Tensor, length: int, axis: int = -1) → Tensor

Pad or trim the audio array to a specified length along a given axis.

This method is used to ensure that the input audio tensor is of the required length for processing, which is particularly useful in zero-shot inference cases where input sizes may vary.

Parameters:
- array (torch.Tensor) – The input audio tensor to be padded or trimmed.
- length (int) – The desired length of the audio tensor along the specified axis.
- axis (int , optional) – The axis along which to pad or trim. Defaults to -1 (the last dimension).
Returns: The padded or trimmed audio tensor of the specified length.
Return type: torch.Tensor

############### Examples

>>> import torch
>>> pad_length = 16000  # 1 second of audio at 16kHz
>>> audio_tensor = torch.randn(1, 20000)  # A tensor with more samples
>>> trimmed_tensor = pad_or_trim(audio_tensor, pad_length)
>>> trimmed_tensor.shape
torch.Size([1, 16000])  # Output is trimmed to 16000 samples

>>> audio_tensor = torch.randn(1, 15000)  # A tensor with fewer samples
>>> padded_tensor = pad_or_trim(audio_tensor, pad_length)
>>> padded_tensor.shape
torch.Size([1, 16000])  # Output is padded to 16000 samples

######### NOTE If the input tensor is larger than the specified length, it will be trimmed. If it is smaller, it will be padded with zeros.

whisper_encode(input: Tensor, ilens: Tensor | None = None) → Tensor

Encode input audio using the Whisper model’s encoder.

This method processes the input tensor through several convolutional layers, applies positional encoding, and passes the result through the transformer blocks of the Whisper model. The output is the encoded representation of the audio along with the output lengths.

Parameters:
- input (torch.Tensor) – A tensor of shape (batch_size, input_size, time) representing the input audio features.
- ilens (torch.Tensor , optional) – A tensor of shape (batch_size,) containing the lengths of each input sequence. If not provided, the output lengths will not be computed.
Returns: A tuple containing: : - A tensor of shape (batch_size, n_frames, output_size) with the encoded features.
- A tensor of shape (batch_size,) with the output lengths, or None if ilens was not provided.
Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]

############### Examples

>>> encoder = OpenAIWhisperEncoder()
>>> audio_input = torch.randn(2, 1, 16000)  # (batch_size, channels, time)
>>> output, output_lengths = encoder.whisper_encode(audio_input)

######### NOTE The input audio tensor should be pre-processed to match the input requirements of the Whisper model. Ensure that the input size matches the expected shape for the model.

Raises:
- ValueError – If the input tensor does not have the correct number of
- dimensions or if the lengths tensor is of incorrect shape. –