espnet2.tts.feats_extract.energy.Energy
espnet2.tts.feats_extract.energy.Energy
class espnet2.tts.feats_extract.energy.Energy(fs: int | str = 22050, n_fft: int = 1024, win_length: int | None = None, hop_length: int = 256, window: str = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_token_averaged_energy: bool = True, reduction_factor: int | None = None)
Bases: AbsFeatsExtract
Energy extractor for audio features.
This class implements an energy extraction mechanism from audio signals. It utilizes Short-Time Fourier Transform (STFT) to compute the energy of the input audio and offers functionalities to adjust and average the energy based on token durations.
fs
Sampling frequency of the input audio.
- Type: int
n_fft
Number of FFT points.
- Type: int
win_length
Length of the window for STFT.
- Type: Optional[int]
hop_length
Hop length for STFT.
- Type: int
window
Type of window to use for STFT.
- Type: str
center
Whether to center the input signal.
- Type: bool
normalized
Whether to normalize the output.
- Type: bool
onesided
Whether to use one-sided STFT.
- Type: bool
use_token_averaged_energy
Whether to use averaged energy per token.
- Type: bool
reduction_factor
Factor for reducing the energy length.
Type: Optional[int]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency of the audio. Can be an int or a human-friendly string (e.g., “22k”).
- n_fft (int) – Number of FFT points. Default is 1024.
- win_length (Optional *[*int ]) – Length of the window. Default is None, which uses n_fft.
- hop_length (int) – Hop length for STFT. Default is 256.
- window (str) – Type of window function to use. Default is “hann”.
- center (bool) – Whether to center the input signal. Default is True.
- normalized (bool) – Whether to normalize the output. Default is False.
- onesided (bool) – Whether to use one-sided STFT. Default is True.
- use_token_averaged_energy (bool) – Whether to average energy per token. Default is True.
- reduction_factor (Optional *[*int ]) – Factor for reducing the energy length. Must be >= 1 if use_token_averaged_energy is True.
Returns: A tuple containing: : - energy (torch.Tensor): Extracted energy of shape (B, T, 1).
- energy_lengths (torch.Tensor): Lengths of the energy sequences.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If reduction_factor is less than 1 when use_token_averaged_energy is True.
########### Examples
>>> energy_extractor = Energy(fs=22050, n_fft=1024, hop_length=256)
>>> input_audio = torch.randn(10, 16000) # Batch of 10 audio signals
>>> energy, lengths = energy_extractor(input_audio)
NOTE
This class is a subclass of AbsFeatsExtract and must be used in accordance with its interface.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, input_lengths: Tensor | None = None, feats_lengths: Tensor | None = None, durations: Tensor | None = None, durations_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]
Extracts energy features from audio input.
This class inherits from the abstract class AbsFeatsExtract and provides methods to compute energy features from audio signals using Short-Time Fourier Transform (STFT). It supports various configurations for the STFT and energy calculation.
fs
Sampling frequency of the audio signal.
- Type: int
n_fft
Number of FFT points.
- Type: int
hop_length
Number of samples between successive frames.
- Type: int
win_length
Length of each windowed segment.
- Type: Optional[int]
window
Type of window function to use (e.g., “hann”).
- Type: str
use_token_averaged_energy
Whether to use token-averaged energy.
- Type: bool
reduction_factor
Factor by which to reduce energy.
Type: Optional[int]
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency, can be an integer or a string.
- n_fft (int) – Number of FFT points (default: 1024).
- win_length (Optional *[*int ]) – Length of each windowed segment (default: None).
- hop_length (int) – Number of samples between successive frames (default: 256).
- window (str) – Type of window function to use (default: “hann”).
- center (bool) – Whether to center the signal (default: True).
- normalized (bool) – Whether to normalize the output (default: False).
- onesided (bool) – Whether to use one-sided spectrum (default: True).
- use_token_averaged_energy (bool) – Whether to use token-averaged energy (default: True).
- reduction_factor (Optional *[*int ]) – Factor by which to reduce energy (default: None).
Returns: A tuple containing: : - Energy features of shape (B, T, 1).
- Lengths of the energy features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input tensor’s shape is not valid or if the reduction_factor is invalid when use_token_averaged_energy is True.
########### Examples
>>> energy_extractor = Energy(fs=22050, n_fft=1024)
>>> input_tensor = torch.randn(10, 16000) # Batch of 10 audio signals
>>> energy, lengths = energy_extractor.forward(input_tensor)
NOTE
The input tensor should have the shape (B, T), where B is the batch size and T is the number of time steps.
get_parameters() → Dict[str, Any]
Retrieve the parameters of the Energy extractor.
This method returns a dictionary containing the key parameters used in the Energy extractor, which are essential for understanding the configuration and setup of the feature extraction process.
- Returns: A dictionary containing the parameters of the Energy extractor, including sample rate, FFT size, hop length, window type, and other relevant configurations.
- Return type: Dict[str, Any]
########### Examples
>>> energy_extractor = Energy()
>>> parameters = energy_extractor.get_parameters()
>>> print(parameters)
{
'fs': 22050,
'n_fft': 1024,
'hop_length': 256,
'window': 'hann',
'win_length': None,
'center': True,
'normalized': False,
'use_token_averaged_energy': True,
'reduction_factor': None
}
output_size() → int
Returns the output size of the energy extractor.
This method returns a fixed output size of 1, which corresponds to the energy feature extracted from the input signal. This is a property of the energy extraction process, as the energy feature is a scalar value for each input frame.
- Returns: The output size of the energy extractor, which is always 1.
- Return type: int
########### Examples
>>> energy_extractor = Energy()
>>> output_size = energy_extractor.output_size()
>>> print(output_size)
1