espnet2.tts.feats_extract.dio.Dio

About 3 min

espnet2.tts.feats_extract.dio.Dio

class espnet2.tts.feats_extract.dio.Dio(fs: int | str = 22050, n_fft: int = 1024, hop_length: int = 256, f0min: int = 80, f0max: int = 400, use_token_averaged_f0: bool = True, use_continuous_f0: bool = True, use_log_f0: bool = True, reduction_factor: int_or_none = None)

Bases: AbsFeatsExtract

F0 estimation with dio + stonemask algorithm.

This class implements an F0 extractor based on the DIO (Dynamic Interpolation of the Observed) and Stonemask algorithms introduced in WORLD: a vocoder-based high-quality speech synthesis system for real-time applications.

Sampling frequency in Hz.

Type: int

n_fft

Number of FFT points.

Type: int

hop_length

Hop length for the analysis.

Type: int

frame_period

Frame period calculated from hop length and fs.

Type: float

f0min

Minimum frequency for F0 extraction.

Type: int

f0max

Maximum frequency for F0 extraction.

Type: int

use_token_averaged_f0

Flag to use token-averaged F0.

Type: bool

use_continuous_f0

Flag to use continuous F0.

Type: bool

use_log_f0

Flag to use logarithmic F0.

Type: bool

reduction_factor

Factor for reduction when averaging.

Type: int or None

NOTE

This module is based on NumPy implementation. Therefore, the computational graph is not connected.

########### Examples

dio = Dio(fs=22050, n_fft=1024, hop_length=256) input_tensor = torch.randn(5, 1024) # Batch of 5 inputs pitch, pitch_lengths = dio.forward(input_tensor)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor | None = None, feats_lengths: Tensor | None = None, durations: Tensor | None = None, durations_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]

Extract F0 features from the input tensor using the DIO + Stonemask

algorithm.

This method processes the input tensor to extract fundamental frequency (F0) features, optionally adjusting the output based on provided lengths and averaging the results based on durations.

Parameters:
- input (torch.Tensor) – Input tensor of shape (B, T), where B is the batch size and T is the number of time frames.
- input_lengths (torch.Tensor , optional) – Lengths of each input in the batch. If None, assumes all inputs have the same length.
- feats_lengths (torch.Tensor , optional) – Target lengths for the output F0 features. If provided, the output will be adjusted accordingly.
- durations (torch.Tensor , optional) – Durations for averaging F0 values when use_token_averaged_f0 is True.
- durations_lengths (torch.Tensor , optional) – Lengths of the durations tensor.
Returns: A tuple containing: : - pitch (torch.Tensor): Extracted F0 features of shape (B, T, 1).
- pitch_lengths (torch.Tensor): Lengths of the extracted F0 features for each input in the batch.
Return type: Tuple[torch.Tensor, torch.Tensor]

########### Examples

>>> dio = Dio()
>>> input_tensor = torch.randn(2, 16000)  # Example input
>>> output, lengths = dio.forward(input_tensor)

NOTE

The output shape will be (B, T, 1), where B is the batch size and T is the number of time frames.

Raises:
- AssertionError – If reduction_factor is not set correctly when
- use_token_averaged_f0 –

get_parameters() → Dict[str, Any]

Returns the parameters of the Dio instance as a dictionary.

This method gathers the configuration parameters used in the Dio instance and returns them in a dictionary format. This is useful for inspecting the current settings of the F0 extractor.

Returns:
- fs: The sampling frequency.
- n_fft: The FFT size.
- hop_length: The hop length.
- f0min: The minimum F0 value.
- f0max: The maximum F0 value.
- use_token_averaged_f0: Whether to use token-averaged F0.
- use_continuous_f0: Whether to use continuous F0.
- use_log_f0: Whether to use logarithmic F0.
- reduction_factor: The reduction factor for averaging.
Return type: A dictionary containing the following keys and their values

########### Examples

>>> dio = Dio()
>>> params = dio.get_parameters()
>>> print(params)
{'fs': 22050, 'n_fft': 1024, 'hop_length': 256, 'f0min': 80,
 'f0max': 400, 'use_token_averaged_f0': True,
 'use_continuous_f0': True, 'use_log_f0': True,
 'reduction_factor': None}

output_size() → int

Returns the output size of the Dio F0 extractor.

This method returns a fixed output size of 1, which represents the dimensionality of the output features produced by the Dio algorithm.

Returns: The output size, which is always 1.
Return type: int

########### Examples

>>> dio_extractor = Dio()
>>> output_size = dio_extractor.output_size()
>>> print(output_size)
1