espnet2.s2st.tgt_feats_extract.log_mel_fbank.LogMelFbank

About 5 min

espnet2.s2st.tgt_feats_extract.log_mel_fbank.LogMelFbank

class espnet2.s2st.tgt_feats_extract.log_mel_fbank.LogMelFbank(fs: int | str = 16000, n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: str | None = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int | None = 80, fmax: int | None = 7600, htk: bool = False, log_base: float | None = 10.0)

Bases: AbsTgtFeatsExtract

LogMelFbank is a conventional frontend structure for Text-to-Speech (TTS) systems. It processes audio input through a series of transformations including Short-Time Fourier Transform (STFT), amplitude spectrum computation, and finally, conversion to Log-Mel filterbank features.

The sequence of operations is as follows:

STFT: Converts time-domain signal to time-frequency domain.
Amplitude-Spec: Computes the amplitude from the STFT.
Log-Mel-Fbank: Applies a Log-Mel filterbank to the amplitude spectrum.

Sampling frequency of the input audio.

Type: int

n_mels

Number of Mel bands to generate.

Type: int

n_fft

Size of the FFT window.

Type: int

hop_length

Number of samples between frames.

Type: int

win_length

Length of the windowed signal.

Type: Optional[int]

window

Type of window function to apply.

Type: Optional[str]

fmin

Minimum frequency (in Hz) to consider.

Type: Optional[int]

fmax

Maximum frequency (in Hz) to consider.

Type: Optional[int]

stft

Instance of the STFT class for time-frequency conversion.

Type:Stft

logmel

Instance of the LogMel class for Mel feature extraction.

Type:LogMel
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency (default is 16000).
- n_fft (int) – Size of the FFT window (default is 1024).
- win_length (Optional *[*int ]) – Length of the windowed signal (default is None).
- hop_length (int) – Number of samples between frames (default is 256).
- window (Optional *[*str ]) – Type of window function to apply (default is “hann”).
- center (bool) – If True, the signal is padded so that the window is centered at the current frame (default is True).
- normalized (bool) – If True, the output is normalized (default is False).
- onesided (bool) – If True, only the positive frequencies are returned (default is True).
- n_mels (int) – Number of Mel bands to generate (default is 80).
- fmin (Optional *[*int ]) – Minimum frequency (in Hz) to consider (default is 80).
- fmax (Optional *[*int ]) – Maximum frequency (in Hz) to consider (default is 7600).
- htk (bool) – If True, use HTK formula for Mel scale (default is False).
- log_base (Optional *[*float ]) – Base of the logarithm (default is 10.0).
Returns: A tuple containing: : - torch.Tensor: Extracted Log-Mel features.
- torch.Tensor: Lengths of the features for each input.
Return type: Tuple[torch.Tensor, torch.Tensor]

############# Examples

Create an instance of LogMelFbank

log_mel_fbank = LogMelFbank(fs=16000, n_mels=80)

Forward pass with a sample input tensor

input_tensor = torch.randn(1, 16000) # Example input features, lengths = log_mel_fbank(input_tensor)

######## NOTE The implementation assumes that the input tensor is in the shape (batch_size, time) for single-channel audio.

Raises:AssertionError – If the input STFT does not have the expected dimensions.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]

Forward pass for the LogMelFbank module.

This method processes the input audio tensor and converts it into a log-mel spectrogram. It first applies Short-Time Fourier Transform (STFT) to convert the time-domain signal into the frequency domain, then computes the amplitude, and finally applies the log-mel filterbank to produce the output features.

Parameters:
- input (torch.Tensor) – Input audio tensor of shape (…, time).
- input_lengths (torch.Tensor , optional) – Lengths of the input sequences. Defaults to None.
Returns: A tuple containing: : - A tensor of shape (…, n_mels, time) representing the log-mel spectrogram.
- A tensor containing the lengths of the output features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input_stft does not have at least 4 dimensions or if the last dimension is not equal to 2.

############# Examples

>>> logmel_fbank = LogMelFbank()
>>> audio_input = torch.randn(1, 16000)  # Simulated audio input
>>> features, lengths = logmel_fbank.forward(audio_input)
>>> print(features.shape)  # Output shape will be (..., n_mels, time)

######## NOTE The log-mel computation is defined differently for TTS and ASR:

TTS: log_10(abs(stft))
ASR: log_e(power(stft))

get_parameters() → Dict[str, Any]

Return the parameters required by Vocoder.

This method retrieves and returns a dictionary containing the configuration parameters for the vocoder. These parameters are essential for the vocoder to process audio data correctly.

Returns:
- fs: Sampling frequency (int)
- n_fft: Number of FFT points (int)
- n_shift: Hop length for the STFT (int)
- window: Window type used for STFT (str)
- n_mels: Number of Mel bands (int)
- win_length: Window length for STFT (int or None)
- fmin: Minimum frequency (int or None)
- fmax: Maximum frequency (int or None)
Return type: A dictionary with the following keys and their corresponding values

############# Examples

>>> logmel_fbank = LogMelFbank()
>>> parameters = logmel_fbank.get_parameters()
>>> print(parameters)
{'fs': 16000, 'n_fft': 1024, 'n_shift': 256,
 'window': 'hann', 'n_mels': 80, 'win_length': None,
 'fmin': 80, 'fmax': 7600}

output_size() → int

Get the output size of the LogMelFbank feature extractor.

This property returns the number of Mel frequency bins used in the Log-Mel-Fbank representation.

Returns: The number of Mel frequency bins (n_mels) configured in the LogMelFbank instance.
Return type: int

############# Examples

logmel_fbank = LogMelFbank(n_mels=80) output_size = logmel_fbank.output_size() # output_size will be 80

######## NOTE The output size corresponds to the n_mels parameter set during the initialization of the LogMelFbank class.

spectrogram() → bool

Conventional frontend structure for TTS.

Stft -> amplitude-spec -> Log-Mel-Fbank

Sampling frequency of the audio signal.

Type: Union[int, str]

n_fft

Number of FFT points.

Type: int

win_length

Length of each windowed segment.

Type: int, optional

hop_length

Number of samples between adjacent frames.

Type: int

window

Window type to apply (default is “hann”).

Type: Optional[str]

center

Whether to pad the input signal on both sides.

Type: bool

normalized

Whether to normalize the output.

Type: bool

onesided

Whether to use a one-sided spectrum.

Type: bool

n_mels

Number of Mel bands to generate.

Type: int

fmin

Minimum frequency (default is 80 Hz).

Type: Optional[int]

fmax

Maximum frequency (default is 7600 Hz).

Type: Optional[int]

htk

Use HTK formula for Mel scale if True.

Type: bool

log_base

Base of the logarithm (default is 10.0).

Type: Optional[float]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency (default is 16000).
- n_fft (int) – Number of FFT points (default is 1024).
- win_length (int , optional) – Length of each windowed segment.
- hop_length (int) – Number of samples between adjacent frames (default is 256).
- window (Optional *[*str ]) – Window type (default is “hann”).
- center (bool) – Whether to pad the input signal on both sides (default is True).
- normalized (bool) – Whether to normalize the output (default is False).
- onesided (bool) – Whether to use a one-sided spectrum (default is True).
- n_mels (int) – Number of Mel bands (default is 80).
- fmin (Optional *[*int ]) – Minimum frequency (default is 80).
- fmax (Optional *[*int ]) – Maximum frequency (default is 7600).
- htk (bool) – Use HTK formula for Mel scale if True (default is False).
- log_base (Optional *[*float ]) – Base of the logarithm (default is 10.0).
Returns: None

############# Examples

logmel = LogMelFbank(fs=16000, n_mels=80) print(logmel.output_size()) # Outputs: 80 params = logmel.get_parameters() print(params) # Outputs: Parameters dictionary for vocoder.

######## NOTE This class is designed to work within the ESPnet framework and follows the conventional TTS frontend processing pipeline.

Raises:ValueError – If any of the input parameters are invalid.