espnet2.gan_tts.hifigan.loss.MelSpectrogramLoss

About 2 min

espnet2.gan_tts.hifigan.loss.MelSpectrogramLoss

class espnet2.gan_tts.hifigan.loss.MelSpectrogramLoss(fs: int = 22050, n_fft: int = 1024, hop_length: int = 256, win_length: int | None = None, window: str = 'hann', n_mels: int = 80, fmin: int | None = 0, fmax: int | None = None, center: bool = True, normalized: bool = False, onesided: bool = True, log_base: float | None = 10.0)

Bases: Module

Mel-spectrogram loss module.

This module computes the loss between the generated and ground truth mel-spectrograms. It can operate in either L1 loss or MSE loss mode.

wav_to_mel

An instance of the LogMelFbank class used to convert waveforms to mel-spectrograms.

Type:LogMelFbank
Parameters:
- fs (int) – Sampling rate. Defaults to 22050.
- n_fft (int) – FFT points. Defaults to 1024.
- hop_length (int) – Hop length. Defaults to 256.
- win_length (Optional *[*int ]) – Window length. If None, defaults to win_length = n_fft.
- window (str) – Window type. Defaults to “hann”.
- n_mels (int) – Number of Mel basis. Defaults to 80.
- fmin (Optional *[*int ]) – Minimum frequency for Mel. Defaults to 0.
- fmax (Optional *[*int ]) – Maximum frequency for Mel. If None, defaults to fs / 2.
- center (bool) – Whether to use center window. Defaults to True.
- normalized (bool) – Whether to use normalized one. Defaults to False.
- onesided (bool) – Whether to use onesided one. Defaults to True.
- log_base (Optional *[*float ]) – Log base value. Defaults to 10.0.

####### Examples

Initialize the MelSpectrogramLoss

loss_fn = MelSpectrogramLoss()

Calculate the loss

y_hat = torch.randn(1, 1, 16000) # Example generated waveform y = torch.randn(1, 1, 16000) # Example groundtruth waveform loss = loss_fn(y_hat, y)

NOTE

This loss can be used for training generative models in tasks like text-to-speech or audio synthesis.

Initialize Mel-spectrogram loss.

Parameters:
- fs (int) – Sampling rate.
- n_fft (int) – FFT points.
- hop_length (int) – Hop length.
- win_length (Optional *[*int ]) – Window length.
- window (str) – Window type.
- n_mels (int) – Number of Mel basis.
- fmin (Optional *[*int ]) – Minimum frequency for Mel.
- fmax (Optional *[*int ]) – Maximum frequency for Mel.
- center (bool) – Whether to use center window.
- normalized (bool) – Whether to use normalized one.
- onesided (bool) – Whether to use oneseded one.
- log_base (Optional *[*float ]) – Log base value.

forward(y_hat: Tensor, y: Tensor, spec: Tensor | None = None, use_mse: bool = False) → Tensor

Calculate Mel-spectrogram loss.

Parameters:
- y_hat (Tensor) – Generated waveform tensor (B, 1, T).
- y (Tensor) – Groundtruth waveform tensor (B, 1, T).
- spec (Optional *[*Tensor ]) – Groundtruth linear amplitude spectrum tensor (B, T, n_fft // 2 + 1). If provided, use it instead of groundtruth waveform.
- use_mse (bool) – Whether to use mse_loss instead of l1.
Returns: Mel-spectrogram loss value.
Return type: Tensor

####### Examples

>>> loss_fn = MelSpectrogramLoss()
>>> y_hat = torch.randn(2, 1, 16000)  # Generated waveform
>>> y = torch.randn(2, 1, 16000)      # Groundtruth waveform
>>> loss = loss_fn(y_hat, y)
>>> print(loss)

NOTE

This loss can be used in training neural networks for tasks such as speech synthesis, where the objective is to minimize the difference between the generated audio and the target audio in the Mel-spectrogram domain.