espnet2.asr.frontend.melspec_torch.MelSpectrogramTorch

About 2 min

espnet2.asr.frontend.melspec_torch.MelSpectrogramTorch

class espnet2.asr.frontend.melspec_torch.MelSpectrogramTorch(preemp: bool = True, n_fft: int = 512, log: bool = False, win_length: int = 400, hop_length: int = 160, f_min: int = 20, f_max: int = 7600, n_mels: int = 80, window_fn: str = 'hamming', mel_scale: str = 'htk', normalize: str | None = None)

Bases: AbsFrontend

MelSpectrogramTorch is a class that computes the Mel-spectrogram of audio signals using the Torchaudio library. This class is part of the ESPnet2 framework and extends the abstract frontend class AbsFrontend. It provides functionality to preprocess audio data into Mel-spectrograms, which are commonly used in speech recognition tasks.

log

Indicates whether to apply logarithmic scaling to the output.

Type: bool

n_mels

The number of Mel frequency bins.

Type: int

preemp

Indicates whether to apply pre-emphasis to the input signal.

Type: bool

normalize

Method of normalization. Options include “mn” for mean normalization.

Type: Optional[str]

window_fn

The window function to apply (Hanning or Hamming).

Type: Callable
Parameters:
- preemp (bool) – Whether to apply pre-emphasis (default: True).
- n_fft (int) – Number of FFT points (default: 512).
- log (bool) – Whether to apply logarithmic scaling (default: False).
- win_length (int) – Window length for FFT (default: 400).
- hop_length (int) – Hop length for FFT (default: 160).
- f_min (int) – Minimum frequency (default: 20).
- f_max (int) – Maximum frequency (default: 7600).
- n_mels (int) – Number of Mel bands (default: 80).
- window_fn (str) – Type of window function (“hamming” or “hann”, default: “hamming”).
- mel_scale (str) – Type of Mel scale (“htk” or other) (default: “htk”).
- normalize (Optional *[*str ]) – Normalization method (default: None).
Returns: A tuple containing the Mel-spectrogram tensor and the tensor of input lengths.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:
- AssertionError – If the input tensor does not have 2 dimensions.
- NotImplementedError – If an unsupported normalization method is specified.

######### Examples

>>> mel_spectrogram = MelSpectrogramTorch()
>>> audio_tensor = torch.randn(1, 16000)  # Example audio tensor
>>> input_length = torch.tensor([16000])   # Length of the input audio
>>> mel_spec, mel_length = mel_spectrogram(audio_tensor, input_length)

NOTE

This implementation utilizes GPU acceleration if available. Ensure that the input tensor is on the correct device.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_length: Tensor) → Tuple[Tensor, Tensor]

Compute the Mel-spectrogram of the input audio tensor.

This method applies a series of transformations to the input tensor, including optional pre-emphasis, Mel-spectrogram conversion, and logarithmic scaling, to produce a time-frequency representation.

Parameters:
- input (torch.Tensor) – A 2D tensor of shape (batch_size, num_samples) representing the input audio waveform.
- input_length (torch.Tensor) – A 1D tensor of shape (batch_size,) containing the lengths of the input audio samples.
Returns: A tuple containing: : - A 3D tensor of shape (batch_size, n_mels, num_frames) representing the Mel-spectrogram features.
- A 1D tensor of shape (batch_size,) containing the lengths of the Mel-spectrogram features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input tensor does not have exactly 2 dimensions.

######### Examples

>>> model = MelSpectrogramTorch()
>>> audio_input = torch.randn(2, 16000)  # batch of 2 audio signals
>>> input_length = torch.tensor([16000, 16000])  # lengths of audio
>>> mel_spectrogram, mel_length = model(audio_input, input_length)
>>> print(mel_spectrogram.shape)  # Output shape: (2, 80, num_frames)

NOTE

The pre-emphasis step can be enabled or disabled via the constructor parameter preemp. The logarithmic scaling can be controlled with the log parameter.

output_size() → int

Returns the number of Mel frequency cepstral coefficients (MFCCs) generated by the MelSpectrogramTorch instance.

This method provides the output size, which corresponds to the number of Mel bands specified during the initialization of the MelSpectrogramTorch class. It can be useful for understanding the shape of the output tensor produced by the forward method.

Returns: The number of Mel bands (n_mels) used in the spectrogram.
Return type: int

######### Examples

mel_spectrogram = MelSpectrogramTorch(n_mels=80) output_length = mel_spectrogram.output_size() print(output_length) # Output: 80