espnet2.s2st.tgt_feats_extract.log_spectrogram.LogSpectrogram

About 4 min

espnet2.s2st.tgt_feats_extract.log_spectrogram.LogSpectrogram

class espnet2.s2st.tgt_feats_extract.log_spectrogram.LogSpectrogram(n_fft: int = 1024, win_length: int = None, hop_length: int = 256, window: str | None = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True)

Bases: AbsTgtFeatsExtract

LogSpectrogram is a conventional frontend structure for Automatic Speech Recognition (ASR). It processes audio input to produce a log-amplitude spectrogram from the Short-Time Fourier Transform (STFT).

The main processing steps are as follows:

Apply STFT to convert time-domain signals to the time-frequency domain.
Compute the log-amplitude spectrum from the power spectrum of the STFT.

n_fft

The number of FFT points.

Type: int

hop_length

The number of samples between each frame.

Type: int

win_length

The length of the windowed signal.

Type: Optional[int]

window

The type of window function to use.

Type: Optional[str]

stft

An instance of the STFT class for performing STFT.

Type:Stft
Parameters:
- n_fft (int) – Number of FFT points (default is 1024).
- win_length (Optional *[*int ]) – Length of the windowed signal (default is None).
- hop_length (int) – Number of samples between each frame (default is 256).
- window (Optional *[*str ]) – Type of window function (default is “hann”).
- center (bool) – Whether to pad the signal on both sides (default is True).
- normalized (bool) – Whether to normalize the output (default is False).
- onesided (bool) – Whether to return a one-sided spectrum (default is True).
Returns: A tuple containing: : - log_amplitude (torch.Tensor): The log-amplitude spectrogram.
- feats_lens (torch.Tensor): The lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input STFT tensor does not have the expected shape.

############# Examples

>>> log_spec = LogSpectrogram(n_fft=2048, hop_length=512)
>>> input_tensor = torch.randn(1, 16000)  # Simulated audio input
>>> log_amp, lengths = log_spec.forward(input_tensor)
>>> print(log_amp.shape)  # Output shape will depend on input length

####### NOTE The log-amplitude spectrum is defined differently for Text-to-Speech (TTS) and ASR. In TTS, it is computed as log_10(abs(stft)), while in ASR, it is computed as log_e(power(stft)).

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]

Forward pass to compute the log-amplitude spectrogram from the input tensor.

This method performs a Short-Time Fourier Transform (STFT) on the input audio tensor, calculates the power spectrum, and then computes the log-amplitude spectrogram. The output consists of the log-amplitude features and their corresponding lengths.

Parameters:
- input (torch.Tensor) – The input audio tensor of shape (batch_size, num_channels, num_samples).
- input_lengths (torch.Tensor , optional) – A tensor containing the lengths of the input sequences. If provided, it should have the shape (batch_size,).
Returns: A tuple containing: : - log_amp (torch.Tensor): The computed log-amplitude spectrogram of shape (batch_size, num_freq_bins, time_steps).
- feats_lens (torch.Tensor): The lengths of the features for each input in the batch.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:
- AssertionError – If the input STFT output does not have the expected
- dimensions or if the last dimension does not correspond to the real –
- and imaginary parts. –

############# Examples

>>> log_spectrogram = LogSpectrogram()
>>> input_tensor = torch.randn(2, 1, 16000)  # Example input tensor
>>> output, lengths = log_spectrogram.forward(input_tensor)
>>> print(output.shape)  # Output shape: (2, num_freq_bins, time_steps)

####### NOTE The definition of log-amplitude spectrogram differs between TTS and ASR:

TTS: log_10(abs(stft))
ASR: log_e(power(stft))

get_parameters() → Dict[str, Any]

Return the parameters required by Vocoder.

This method gathers the essential parameters used in the vocoder process, including the number of FFT points, hop length, window length, and window type. These parameters are crucial for configuring the vocoder’s behavior.

Returns:
- n_fft (int): The number of FFT points.
- n_shift (int): The hop length for the STFT.
- win_length (Optional[int]): The window length for STFT.
- window (Optional[str]): The type of window used.
Return type: A dictionary containing the following parameters

############# Examples

>>> log_spectrogram = LogSpectrogram(n_fft=2048, hop_length=512)
>>> params = log_spectrogram.get_parameters()
>>> print(params)
{'n_fft': 2048, 'n_shift': 512, 'win_length': None, 'window': 'hann'}

output_size() → int

Returns the output size of the log spectrogram, which is calculated as half

the FFT size plus one. This is useful for determining the dimensions of the output tensor after applying the Short-Time Fourier Transform (STFT).

The output size is computed as follows: : output_size = n_fft // 2 + 1

n_fft

The number of FFT points used in the STFT.

Type: int
Returns: The output size of the log spectrogram.
Return type: int

############# Examples

>>> log_spectrogram = LogSpectrogram(n_fft=1024)
>>> log_spectrogram.output_size()
513

spectrogram() → bool

Conventional frontend structure for Automatic Speech Recognition (ASR).

This class processes input audio signals through Short-Time Fourier Transform (STFT) to produce log-amplitude spectrograms, which are commonly used in speech processing tasks.

The processing flow is as follows: : 1. Apply STFT to convert time-domain signal to time-frequency domain. 2. Compute the log-amplitude of the power spectrum.

n_fft

The number of FFT points.

Type: int

hop_length

The number of samples between successive frames.

Type: int

win_length

The length of each windowed signal segment.

Type: Optional[int]

window

The type of window function applied.

Type: Optional[str]

stft

An instance of the Stft class used for STFT processing.

Type:Stft
Parameters:
- n_fft (int) – Number of FFT points. Default is 1024.
- win_length (int , optional) – Length of the window. Default is None.
- hop_length (int) – Number of samples between frames. Default is 256.
- window (str , optional) – Window type. Default is “hann”.
- center (bool) – Whether to pad the signal to center the window. Default is True.
- normalized (bool) – Whether to normalize the output. Default is False.
- onesided (bool) – Whether to return a one-sided spectrum. Default is True.
Returns: A tuple containing the log-amplitude spectrogram and the lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input STFT does not have the expected dimensions.

############# Examples

>>> log_spectrogram = LogSpectrogram(n_fft=1024, hop_length=256)
>>> audio_input = torch.randn(1, 16000)  # Example audio tensor
>>> log_amp, feats_lens = log_spectrogram.forward(audio_input)

####### NOTE The log-amplitude is computed differently for Text-To-Speech (TTS) and ASR. For TTS, it is defined as log_10(abs(stft)), while for ASR, it is defined as log_e(power(stft)).