espnet2.tts.feats_extract.log_mel_fbank.LogMelFbank

About 3 min

espnet2.tts.feats_extract.log_mel_fbank.LogMelFbank

class espnet2.tts.feats_extract.log_mel_fbank.LogMelFbank(fs: int | str = 16000, n_fft: int = 1024, win_length: int | None = None, hop_length: int = 256, window: str | None = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int | None = 80, fmax: int | None = 7600, htk: bool = False, log_base: float | None = 10.0)

Bases: AbsFeatsExtract

LogMelFbank is a conventional frontend structure for Text-to-Speech (TTS) systems. It processes audio signals through Short-Time Fourier Transform (STFT) to produce log-mel filter bank features.

The processing flow is as follows: STFT -> amplitude-spectrum -> Log-Mel-Fbank

Sampling frequency. Can be an integer or a string representing size (e.g., “16k”).

Type: Union[int, str]

n_fft

Number of FFT points.

Type: int

win_length

Window length for STFT. If None, defaults to n_fft.

Type: Optional[int]

hop_length

Hop length for STFT.

Type: int

window

Window function type (e.g., “hann”).

Type: Optional[str]

center

Whether to pad the signal on both sides so that the frame is centered at the point.

Type: bool

normalized

Whether to normalize the STFT output.

Type: bool

onesided

Whether to return a one-sided spectrum.

Type: bool

n_mels

Number of Mel bands to generate.

Type: int

fmin

Minimum frequency (in Hz) to consider.

Type: Optional[int]

fmax

Maximum frequency (in Hz) to consider.

Type: Optional[int]

htk

Whether to use HTK formula for Mel scale.

Type: bool

log_base

Base of the logarithm for log-mel scaling.

Type: Optional[float]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency. Default is 16000.
- n_fft (int) – Number of FFT points. Default is 1024.
- win_length (Optional *[*int ]) – Window length. Default is None.
- hop_length (int) – Hop length. Default is 256.
- window (Optional *[*str ]) – Type of window function. Default is “hann”.
- center (bool) – Centering of the window. Default is True.
- normalized (bool) – Normalization of the output. Default is False.
- onesided (bool) – One-sided spectrum output. Default is True.
- n_mels (int) – Number of Mel bands. Default is 80.
- fmin (Optional *[*int ]) – Minimum frequency. Default is 80.
- fmax (Optional *[*int ]) – Maximum frequency. Default is 7600.
- htk (bool) – HTK formula usage. Default is False.
- log_base (Optional *[*float ]) – Logarithm base. Default is 10.0.
Returns: A tuple containing: : - output features (torch.Tensor): The log-mel features.
- feats_lens (torch.Tensor): The lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]

########### Examples

>>> logmel_fbank = LogMelFbank()
>>> audio_input = torch.randn(1, 16000)  # Simulated audio input
>>> features, lengths = logmel_fbank.forward(audio_input)

NOTE

The TTS definition for log-spectral features differs from ASR. TTS uses log_10(abs(stft)), while ASR uses log_e(power(stft)).

Raises:
- AssertionError – If the input STFT tensor does not have the expected
- dimensions or shape. –

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]

Computes the Log-Mel filterbank features from the input audio tensor.

This method performs a sequence of operations on the input audio tensor, including Short-Time Fourier Transform (STFT), converting the complex spectrogram to amplitude, and then applying the Log-Mel filterbank to extract features suitable for speech synthesis.

Parameters:
- input (torch.Tensor) – The input audio tensor of shape (…, T), where T is the number of time frames.
- input_lengths (torch.Tensor , optional) – A tensor containing the lengths of the input sequences. If not provided, it defaults to None.
Returns: A tuple containing: : - A tensor of shape (…, n_mels, T’) representing the extracted Log-Mel features, where T’ is the number of output time frames.
- A tensor containing the lengths of the output features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input STFT tensor does not have at least 4 dimensions or if the last dimension is not equal to 2.

NOTE

The Log-Mel computation uses a different definition of log-spectra between Text-to-Speech (TTS) and Automatic Speech Recognition (ASR):

TTS: log_10(abs(stft))
ASR: log_e(power(stft))

########### Examples

>>> logmel_fbank = LogMelFbank()
>>> audio_input = torch.randn(1, 16000)  # Example audio input
>>> features, lengths = logmel_fbank.forward(audio_input)
>>> print(features.shape)  # Output shape will be (..., n_mels, T')

get_parameters() → Dict[str, Any]

Return the parameters required by Vocoder.

This method gathers and returns a dictionary of parameters that are essential for the vocoder to operate effectively. These parameters include sampling frequency, FFT size, hop length, window type, number of Mel filters, window length, minimum frequency, and maximum frequency.

Returns: A dictionary containing the following keys and their corresponding values:
- fs (int): Sampling frequency.
- n_fft (int): Number of FFT points.
- n_shift (int): Hop length (number of samples between frames).
- window (str): Type of windowing function used.
- n_mels (int): Number of Mel bands.
- win_length (Optional[int]): Length of the window.
- fmin (Optional[int]): Minimum frequency (in Hz).
- fmax (Optional[int]): Maximum frequency (in Hz).
Return type: Dict[str, Any]

########### Examples

>>> logmel_fbank = LogMelFbank()
>>> params = logmel_fbank.get_parameters()
>>> print(params)
{'fs': 16000, 'n_fft': 1024, 'n_shift': 256,
 'window': 'hann', 'n_mels': 80,
 'win_length': None, 'fmin': 80, 'fmax': 7600}

output_size() → int

Returns the number of Mel frequency bands.

This property retrieves the number of Mel frequency bands configured for the LogMelFbank instance. It is primarily used to determine the size of the output feature representations.

Returns: The number of Mel frequency bands.
Return type: int

########### Examples

logmel_fbank = LogMelFbank(n_mels=80) output_size = logmel_fbank.output_size() print(output_size) # Output: 80