espnet2.tts.feats_extract.log_mel_fbank.LogMelFbank
espnet2.tts.feats_extract.log_mel_fbank.LogMelFbank
class espnet2.tts.feats_extract.log_mel_fbank.LogMelFbank(fs: int | str = 16000, n_fft: int = 1024, win_length: int | None = None, hop_length: int = 256, window: str | None = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int | None = 80, fmax: int | None = 7600, htk: bool = False, log_base: float | None = 10.0)
Bases: AbsFeatsExtract
LogMelFbank is a conventional frontend structure for Text-to-Speech (TTS) systems. It processes audio signals through Short-Time Fourier Transform (STFT) to produce log-mel filter bank features.
The processing flow is as follows: STFT -> amplitude-spectrum -> Log-Mel-Fbank
fs
Sampling frequency. Can be an integer or a string representing size (e.g., “16k”).
- Type: Union[int, str]
n_fft
Number of FFT points.
- Type: int
win_length
Window length for STFT. If None, defaults to n_fft.
- Type: Optional[int]
hop_length
Hop length for STFT.
- Type: int
window
Window function type (e.g., “hann”).
- Type: Optional[str]
center
Whether to pad the signal on both sides so that the frame is centered at the point.
- Type: bool
normalized
Whether to normalize the STFT output.
- Type: bool
onesided
Whether to return a one-sided spectrum.
- Type: bool
n_mels
Number of Mel bands to generate.
- Type: int
fmin
Minimum frequency (in Hz) to consider.
- Type: Optional[int]
fmax
Maximum frequency (in Hz) to consider.
- Type: Optional[int]
htk
Whether to use HTK formula for Mel scale.
- Type: bool
log_base
Base of the logarithm for log-mel scaling.
Type: Optional[float]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency. Default is 16000.
- n_fft (int) – Number of FFT points. Default is 1024.
- win_length (Optional *[*int ]) – Window length. Default is None.
- hop_length (int) – Hop length. Default is 256.
- window (Optional *[*str ]) – Type of window function. Default is “hann”.
- center (bool) – Centering of the window. Default is True.
- normalized (bool) – Normalization of the output. Default is False.
- onesided (bool) – One-sided spectrum output. Default is True.
- n_mels (int) – Number of Mel bands. Default is 80.
- fmin (Optional *[*int ]) – Minimum frequency. Default is 80.
- fmax (Optional *[*int ]) – Maximum frequency. Default is 7600.
- htk (bool) – HTK formula usage. Default is False.
- log_base (Optional *[*float ]) – Logarithm base. Default is 10.0.
Returns: A tuple containing: : - output features (torch.Tensor): The log-mel features.
- feats_lens (torch.Tensor): The lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]
########### Examples
>>> logmel_fbank = LogMelFbank()
>>> audio_input = torch.randn(1, 16000) # Simulated audio input
>>> features, lengths = logmel_fbank.forward(audio_input)
NOTE
The TTS definition for log-spectral features differs from ASR. TTS uses log_10(abs(stft)), while ASR uses log_e(power(stft)).
- Raises:
- AssertionError – If the input STFT tensor does not have the expected
- dimensions or shape. –
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, input_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]
Computes the Log-Mel filterbank features from the input audio tensor.
This method performs a sequence of operations on the input audio tensor, including Short-Time Fourier Transform (STFT), converting the complex spectrogram to amplitude, and then applying the Log-Mel filterbank to extract features suitable for speech synthesis.
- Parameters:
- input (torch.Tensor) – The input audio tensor of shape (…, T), where T is the number of time frames.
- input_lengths (torch.Tensor , optional) – A tensor containing the lengths of the input sequences. If not provided, it defaults to None.
- Returns: A tuple containing: : - A tensor of shape (…, n_mels, T’) representing the extracted Log-Mel features, where T’ is the number of output time frames.
- A tensor containing the lengths of the output features.
- Return type: Tuple[torch.Tensor, torch.Tensor]
- Raises:AssertionError – If the input STFT tensor does not have at least 4 dimensions or if the last dimension is not equal to 2.
NOTE
The Log-Mel computation uses a different definition of log-spectra between Text-to-Speech (TTS) and Automatic Speech Recognition (ASR):
- TTS: log_10(abs(stft))
- ASR: log_e(power(stft))
########### Examples
>>> logmel_fbank = LogMelFbank()
>>> audio_input = torch.randn(1, 16000) # Example audio input
>>> features, lengths = logmel_fbank.forward(audio_input)
>>> print(features.shape) # Output shape will be (..., n_mels, T')
get_parameters() → Dict[str, Any]
Return the parameters required by Vocoder.
This method gathers and returns a dictionary of parameters that are essential for the vocoder to operate effectively. These parameters include sampling frequency, FFT size, hop length, window type, number of Mel filters, window length, minimum frequency, and maximum frequency.
- Returns: A dictionary containing the following keys and their corresponding values:
- fs (int): Sampling frequency.
- n_fft (int): Number of FFT points.
- n_shift (int): Hop length (number of samples between frames).
- window (str): Type of windowing function used.
- n_mels (int): Number of Mel bands.
- win_length (Optional[int]): Length of the window.
- fmin (Optional[int]): Minimum frequency (in Hz).
- fmax (Optional[int]): Maximum frequency (in Hz).
- Return type: Dict[str, Any]
########### Examples
>>> logmel_fbank = LogMelFbank()
>>> params = logmel_fbank.get_parameters()
>>> print(params)
{'fs': 16000, 'n_fft': 1024, 'n_shift': 256,
'window': 'hann', 'n_mels': 80,
'win_length': None, 'fmin': 80, 'fmax': 7600}
output_size() → int
Returns the number of Mel frequency bands.
This property retrieves the number of Mel frequency bands configured for the LogMelFbank instance. It is primarily used to determine the size of the output feature representations.
- Returns: The number of Mel frequency bands.
- Return type: int
########### Examples
logmel_fbank = LogMelFbank(n_mels=80) output_size = logmel_fbank.output_size() print(output_size) # Output: 80