espnet2.tts.feats_extract.dio.Dio
espnet2.tts.feats_extract.dio.Dio
class espnet2.tts.feats_extract.dio.Dio(fs: int | str = 22050, n_fft: int = 1024, hop_length: int = 256, f0min: int = 80, f0max: int = 400, use_token_averaged_f0: bool = True, use_continuous_f0: bool = True, use_log_f0: bool = True, reduction_factor: int_or_none = None)
Bases: AbsFeatsExtract
F0 estimation with dio + stonemask algorithm.
This class implements an F0 extractor based on the DIO (Dynamic Interpolation of the Observed) and Stonemask algorithms introduced in WORLD: a vocoder-based high-quality speech synthesis system for real-time applications.
fs
Sampling frequency in Hz.
- Type: int
n_fft
Number of FFT points.
- Type: int
hop_length
Hop length for the analysis.
- Type: int
frame_period
Frame period calculated from hop length and fs.
- Type: float
f0min
Minimum frequency for F0 extraction.
- Type: int
f0max
Maximum frequency for F0 extraction.
- Type: int
use_token_averaged_f0
Flag to use token-averaged F0.
- Type: bool
use_continuous_f0
Flag to use continuous F0.
- Type: bool
use_log_f0
Flag to use logarithmic F0.
- Type: bool
reduction_factor
Factor for reduction when averaging.
- Type: int or None
NOTE
This module is based on NumPy implementation. Therefore, the computational graph is not connected.
########### Examples
dio = Dio(fs=22050, n_fft=1024, hop_length=256) input_tensor = torch.randn(5, 1024) # Batch of 5 inputs pitch, pitch_lengths = dio.forward(input_tensor)
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, input_lengths: Tensor | None = None, feats_lengths: Tensor | None = None, durations: Tensor | None = None, durations_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]
Extract F0 features from the input tensor using the DIO + Stonemask
algorithm.
This method processes the input tensor to extract fundamental frequency (F0) features, optionally adjusting the output based on provided lengths and averaging the results based on durations.
- Parameters:
- input (torch.Tensor) – Input tensor of shape (B, T), where B is the batch size and T is the number of time frames.
- input_lengths (torch.Tensor , optional) – Lengths of each input in the batch. If None, assumes all inputs have the same length.
- feats_lengths (torch.Tensor , optional) – Target lengths for the output F0 features. If provided, the output will be adjusted accordingly.
- durations (torch.Tensor , optional) – Durations for averaging F0 values when use_token_averaged_f0 is True.
- durations_lengths (torch.Tensor , optional) – Lengths of the durations tensor.
- Returns: A tuple containing: : - pitch (torch.Tensor): Extracted F0 features of shape (B, T, 1).
- pitch_lengths (torch.Tensor): Lengths of the extracted F0 features for each input in the batch.
- Return type: Tuple[torch.Tensor, torch.Tensor]
########### Examples
>>> dio = Dio()
>>> input_tensor = torch.randn(2, 16000) # Example input
>>> output, lengths = dio.forward(input_tensor)
NOTE
The output shape will be (B, T, 1), where B is the batch size and T is the number of time frames.
- Raises:
- AssertionError – If reduction_factor is not set correctly when
- use_token_averaged_f0 –
get_parameters() → Dict[str, Any]
Returns the parameters of the Dio instance as a dictionary.
This method gathers the configuration parameters used in the Dio instance and returns them in a dictionary format. This is useful for inspecting the current settings of the F0 extractor.
- Returns:
- fs: The sampling frequency.
- n_fft: The FFT size.
- hop_length: The hop length.
- f0min: The minimum F0 value.
- f0max: The maximum F0 value.
- use_token_averaged_f0: Whether to use token-averaged F0.
- use_continuous_f0: Whether to use continuous F0.
- use_log_f0: Whether to use logarithmic F0.
- reduction_factor: The reduction factor for averaging.
- Return type: A dictionary containing the following keys and their values
########### Examples
>>> dio = Dio()
>>> params = dio.get_parameters()
>>> print(params)
{'fs': 22050, 'n_fft': 1024, 'hop_length': 256, 'f0min': 80,
'f0max': 400, 'use_token_averaged_f0': True,
'use_continuous_f0': True, 'use_log_f0': True,
'reduction_factor': None}
output_size() → int
Returns the output size of the Dio F0 extractor.
This method returns a fixed output size of 1, which represents the dimensionality of the output features produced by the Dio algorithm.
- Returns: The output size, which is always 1.
- Return type: int
########### Examples
>>> dio_extractor = Dio()
>>> output_size = dio_extractor.output_size()
>>> print(output_size)
1