espnet2.tts.feats_extract.energy.Energy

About 4 min

espnet2.tts.feats_extract.energy.Energy

class espnet2.tts.feats_extract.energy.Energy(fs: int | str = 22050, n_fft: int = 1024, win_length: int | None = None, hop_length: int = 256, window: str = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_token_averaged_energy: bool = True, reduction_factor: int | None = None)

Bases: AbsFeatsExtract

Energy extractor for audio features.

This class implements an energy extraction mechanism from audio signals. It utilizes Short-Time Fourier Transform (STFT) to compute the energy of the input audio and offers functionalities to adjust and average the energy based on token durations.

Sampling frequency of the input audio.

Type: int

n_fft

Number of FFT points.

Type: int

win_length

Length of the window for STFT.

Type: Optional[int]

hop_length

Hop length for STFT.

Type: int

window

Type of window to use for STFT.

Type: str

center

Whether to center the input signal.

Type: bool

normalized

Whether to normalize the output.

Type: bool

onesided

Whether to use one-sided STFT.

Type: bool

use_token_averaged_energy

Whether to use averaged energy per token.

Type: bool

reduction_factor

Factor for reducing the energy length.

Type: Optional[int]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency of the audio. Can be an int or a human-friendly string (e.g., “22k”).
- n_fft (int) – Number of FFT points. Default is 1024.
- win_length (Optional *[*int ]) – Length of the window. Default is None, which uses n_fft.
- hop_length (int) – Hop length for STFT. Default is 256.
- window (str) – Type of window function to use. Default is “hann”.
- center (bool) – Whether to center the input signal. Default is True.
- normalized (bool) – Whether to normalize the output. Default is False.
- onesided (bool) – Whether to use one-sided STFT. Default is True.
- use_token_averaged_energy (bool) – Whether to average energy per token. Default is True.
- reduction_factor (Optional *[*int ]) – Factor for reducing the energy length. Must be >= 1 if use_token_averaged_energy is True.
Returns: A tuple containing: : - energy (torch.Tensor): Extracted energy of shape (B, T, 1).
- energy_lengths (torch.Tensor): Lengths of the energy sequences.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If reduction_factor is less than 1 when use_token_averaged_energy is True.

########### Examples

>>> energy_extractor = Energy(fs=22050, n_fft=1024, hop_length=256)
>>> input_audio = torch.randn(10, 16000)  # Batch of 10 audio signals
>>> energy, lengths = energy_extractor(input_audio)

NOTE

This class is a subclass of AbsFeatsExtract and must be used in accordance with its interface.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor | None = None, feats_lengths: Tensor | None = None, durations: Tensor | None = None, durations_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]

Extracts energy features from audio input.

This class inherits from the abstract class AbsFeatsExtract and provides methods to compute energy features from audio signals using Short-Time Fourier Transform (STFT). It supports various configurations for the STFT and energy calculation.

Sampling frequency of the audio signal.

Type: int

n_fft

Number of FFT points.

Type: int

hop_length

Number of samples between successive frames.

Type: int

win_length

Length of each windowed segment.

Type: Optional[int]

window

Type of window function to use (e.g., “hann”).

Type: str

use_token_averaged_energy

Whether to use token-averaged energy.

Type: bool

reduction_factor

Factor by which to reduce energy.

Type: Optional[int]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency, can be an integer or a string.
- n_fft (int) – Number of FFT points (default: 1024).
- win_length (Optional *[*int ]) – Length of each windowed segment (default: None).
- hop_length (int) – Number of samples between successive frames (default: 256).
- window (str) – Type of window function to use (default: “hann”).
- center (bool) – Whether to center the signal (default: True).
- normalized (bool) – Whether to normalize the output (default: False).
- onesided (bool) – Whether to use one-sided spectrum (default: True).
- use_token_averaged_energy (bool) – Whether to use token-averaged energy (default: True).
- reduction_factor (Optional *[*int ]) – Factor by which to reduce energy (default: None).
Returns: A tuple containing: : - Energy features of shape (B, T, 1).
- Lengths of the energy features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input tensor’s shape is not valid or if the reduction_factor is invalid when use_token_averaged_energy is True.

########### Examples

>>> energy_extractor = Energy(fs=22050, n_fft=1024)
>>> input_tensor = torch.randn(10, 16000)  # Batch of 10 audio signals
>>> energy, lengths = energy_extractor.forward(input_tensor)

NOTE

The input tensor should have the shape (B, T), where B is the batch size and T is the number of time steps.

get_parameters() → Dict[str, Any]

Retrieve the parameters of the Energy extractor.

This method returns a dictionary containing the key parameters used in the Energy extractor, which are essential for understanding the configuration and setup of the feature extraction process.

Returns: A dictionary containing the parameters of the Energy extractor, including sample rate, FFT size, hop length, window type, and other relevant configurations.
Return type: Dict[str, Any]

########### Examples

>>> energy_extractor = Energy()
>>> parameters = energy_extractor.get_parameters()
>>> print(parameters)
{
    'fs': 22050,
    'n_fft': 1024,
    'hop_length': 256,
    'window': 'hann',
    'win_length': None,
    'center': True,
    'normalized': False,
    'use_token_averaged_energy': True,
    'reduction_factor': None
}

output_size() → int

Returns the output size of the energy extractor.

This method returns a fixed output size of 1, which corresponds to the energy feature extracted from the input signal. This is a property of the energy extraction process, as the energy feature is a scalar value for each input frame.

Returns: The output size of the energy extractor, which is always 1.
Return type: int

########### Examples

>>> energy_extractor = Energy()
>>> output_size = energy_extractor.output_size()
>>> print(output_size)
1