espnet2.tts.feats_extract.log_spectrogram.LogSpectrogram

About 3 min

espnet2.tts.feats_extract.log_spectrogram.LogSpectrogram

class espnet2.tts.feats_extract.log_spectrogram.LogSpectrogram(n_fft: int = 1024, win_length: int | None = None, hop_length: int = 256, window: str | None = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)

Bases: AbsFeatsExtract

LogSpectrogram is a conventional frontend structure for Automatic Speech Recognition (ASR) that converts time-domain audio signals into log-amplitude spectrograms using Short-Time Fourier Transform (STFT).

The transformation pipeline consists of:

STFT: converting time-domain signals to time-frequency representation.
Log-amplitude spectrum: calculating the logarithm of the amplitude of the

resulting frequency bins.

n_fft

Number of FFT points.

Type: int

hop_length

Number of samples between frames.

Type: int

win_length

Length of the windowed signal.

Type: Optional[int]

window

Type of window to apply.

Type: Optional[str]

stft

Instance of the STFT class for performing STFT.

Type:Stft
Parameters:
- n_fft (int) – Number of FFT points (default is 1024).
- win_length (Optional *[*int ]) – Length of the windowed signal (default is None).
- hop_length (int) – Number of samples between frames (default is 256).
- window (Optional *[*str ]) – Type of window to apply (default is “hann”).
- center (bool) – If True, the signal is padded so that the frame is centered at the original time index (default is True).
- normalized (bool) – If True, the output is normalized (default is False).
- onesided (bool) – If True, only the positive half of the spectrum is returned (default is True).
Returns: A tuple containing the log-amplitude spectrogram and the lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]

########### Examples

>>> log_spectrogram = LogSpectrogram(n_fft=2048, hop_length=512)
>>> audio_input = torch.randn(1, 16000)  # Simulated audio input
>>> log_amp, feats_lens = log_spectrogram(audio_input)

NOTE

The log-spectrogram is defined differently between TTS and ASR:

TTS: log_10(abs(stft))
ASR: log_e(power(stft))

Raises:
- AssertionError – If the input STFT tensor does not have the expected
- dimensions or shape. –

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]

Computes the log-amplitude spectrogram from the input audio tensor.

This method takes an input tensor representing audio signals and converts it into a log-amplitude spectrogram using Short-Time Fourier Transform (STFT). The output consists of the log-amplitude features and their corresponding lengths.

Parameters:
- input (torch.Tensor) – The input audio tensor with shape (batch_size, num_samples).
- input_lengths (torch.Tensor , optional) – A tensor containing the lengths of the input sequences. If None, all sequences are assumed to have the same length.
Returns: A tuple containing: : - log_amp (torch.Tensor): The computed log-amplitude spectrogram with shape (batch_size, num_features, time_steps).
- feats_lens (torch.Tensor): The lengths of the output features after processing.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:
- AssertionError – If the input STFT tensor does not have the expected
- dimensions or if the last dimension does not represent real/imaginary –
- parts. –

########### Examples

>>> model = LogSpectrogram(n_fft=1024)
>>> input_tensor = torch.randn(4, 16000)  # Example batch of audio
>>> log_amp, lengths = model.forward(input_tensor)

NOTE

The log-amplitude is computed using log_10(abs(stft)) for TTS applications and log_e(power(stft)) for ASR applications.

get_parameters() → Dict[str, Any]

Returns the parameters required by the Vocoder.

This method gathers the essential parameters used for the vocoder, which include the number of FFT points, the hop length, the window length, and the window type. These parameters are crucial for generating the correct spectrogram representation needed by the vocoder.

Returns:
- n_fft (int): Number of FFT points.
- n_shift (int): Hop length (number of samples to shift).
- win_length (Optional[int]): Window length (if specified).
- window (Optional[str]): Type of window used for STFT.
Return type: A dictionary containing the following key-value pairs

########### Examples

>>> log_spectrogram = LogSpectrogram(n_fft=2048, hop_length=512)
>>> parameters = log_spectrogram.get_parameters()
>>> print(parameters)
{'n_fft': 2048, 'n_shift': 512, 'win_length': None, 'window': 'hann'}

output_size() → int

Calculate the output size of the LogSpectrogram.

The output size is computed based on the number of FFT points (n_fft) used in the Short-Time Fourier Transform (STFT). The output size represents the number of frequency bins in the log-amplitude spectrogram, which is given by the formula n_fft // 2 + 1.

Returns: The number of frequency bins in the log-amplitude spectrogram.
Return type: int

########### Examples

>>> log_spectrogram = LogSpectrogram(n_fft=1024)
>>> log_spectrogram.output_size()
513