espnet2.tts.feats_extract.ying.Ying
espnet2.tts.feats_extract.ying.Ying
class espnet2.tts.feats_extract.ying.Ying(fs: int = 22050, w_step: int = 256, W: int = 2048, tau_max: int = 2048, midi_start: int = -5, midi_end: int = 75, octave_range: int = 24, use_token_averaged_ying: bool = False)
Bases: AbsFeatsExtract
Extract Ying-based Features.
This class computes the Ying features from raw audio input using methods derived from the NANSY implementation. The Ying features are calculated through a series of transformations applied to the audio signal, including computing the cumulative mean normalized difference function (cMNDF) and converting MIDI to time lag.
fs
Sample rate of the audio.
- Type: int
w_step
Step size for the window in frames.
- Type: int
W
Window size for the analysis.
- Type: int
tau_max
Maximum time lag to consider.
- Type: int
midi_start
Starting MIDI note number.
- Type: int
midi_end
Ending MIDI note number.
- Type: int
octave_range
Number of MIDI notes per octave.
- Type: int
use_token_averaged_ying
Flag to indicate if token-averaged Ying features should be used.
Type: bool
Parameters:
- fs (int) – Sample rate (default: 22050).
- w_step (int) – Window step size (default: 256).
- W (int) – Window size (default: 2048).
- tau_max (int) – Maximum time lag (default: 2048).
- midi_start (int) – Starting MIDI note (default: -5).
- midi_end (int) – Ending MIDI note (default: 75).
- octave_range (int) – Range of octaves (default: 24).
- use_token_averaged_ying (bool) – Use token averaged Ying features (default: False).
Returns: A tuple containing the Ying features and their lengths.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:AssertionError – If the input duration is invalid.
################### Examples
>>> import torch
>>> ying_extractor = Ying()
>>> audio_input = torch.randn(1, 1024) # Simulated raw audio input
>>> ying_features, lengths = ying_extractor(audio_input)
>>> print(ying_features.shape)
torch.Size([1, 80, T']) # Shape will depend on input length
######### NOTE This implementation is designed to be fully differentiable, allowing for integration into neural network training pipelines.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
crop_scope(x, yin_start, scope_shift)
Crop a specified scope from the input tensor based on the YIN start and scope shift.
This method extracts a segment from the input tensor x, using the yin_start index and applying the scope_shift for each batch. It returns a new tensor containing the cropped segments for each batch.
- Parameters:
- x (torch.Tensor) – Input tensor of shape [B, C, T], where B is the batch size, C is the number of channels, and T is the sequence length.
- yin_start (int) – The starting index for cropping from each sequence in the input tensor.
- scope_shift (torch.Tensor) – A tensor of shape [B] representing the shift to apply to the starting index for each batch.
- Returns: A tensor containing the cropped segments from the : input tensor, of shape [B, C, scope_length], where scope_length is determined by the difference between the ending index and the starting index.
- Return type: torch.Tensor
################### Examples
>>> import torch
>>> x = torch.rand(2, 3, 10) # Example input tensor
>>> yin_start = 2
>>> scope_shift = torch.tensor([1, 2])
>>> cropped = crop_scope(x, yin_start, scope_shift)
>>> print(cropped.shape)
torch.Size([2, 3, scope_length]) # Output shape will depend on
# yin_scope and scope_shift values.
forward(input: Tensor, input_lengths: Tensor | None = None, feats_lengths: Tensor | None = None, durations: Tensor | None = None, durations_lengths: Tensor | None = None) → Tuple[Tensor, Tensor]
Computes the forward pass of the Ying feature extractor.
This method processes the input audio tensor to extract Ying-based features. It optionally handles input lengths, feature lengths, and duration information for the audio segments.
- Parameters:
- input (torch.Tensor) – A tensor of shape (B, T) representing the input audio signals, where B is the batch size and T is the number of time steps.
- input_lengths (Optional *[*torch.Tensor ]) – A tensor of shape (B,) containing the lengths of each input sequence. If None, it is assumed that all inputs are of maximum length.
- feats_lengths (Optional *[*torch.Tensor ]) – A tensor of shape (B,) that contains the lengths of the desired output features. This is used for optional length adjustment.
- durations (Optional *[*torch.Tensor ]) – A tensor of shape (B,) containing the duration information for each input segment. Used for averaging when use_token_averaged_ying is True.
- durations_lengths (Optional *[*torch.Tensor ]) – A tensor of shape (B,) containing the lengths of the duration sequences.
- Returns: A tuple containing: : - A tensor of shape (B, F, T’) representing the extracted Ying features, where F is the number of features and T’ is the number of time steps after processing.
- A tensor of shape (B,) containing the lengths of the output features.
- Return type: Tuple[torch.Tensor, torch.Tensor]
######### NOTE The output tensor is converted to float type before returning.
################### Examples
>>> ying_extractor = Ying()
>>> audio_input = torch.randn(2, 4096) # Example input for batch size 2
>>> input_lengths = torch.tensor([4096, 4096])
>>> features, feature_lengths = ying_extractor.forward(audio_input,
... input_lengths)
>>> print(features.shape) # Expected shape: (2, F, T')
- Raises:
- ValueError – If the input tensor dimensions are not as expected or if
- input lengths exceed the tensor dimensions. –
get_parameters() → Dict[str, Any]
Retrieve the parameters of the Ying feature extraction instance.
This method returns a dictionary containing the key parameters used in the Ying feature extraction process. The parameters include the sample rate, window step size, window size, maximum time lag, and whether token-averaged Ying is used.
- Parameters:None
- Returns:
- fs (int): Sample rate.
- w_step (int): Step size for the window.
- W (int): Size of the window.
- tau_max (int): Maximum time lag.
- use_token_averaged_ying (bool): Indicates if token-averaged Ying is used.
- Return type: A dictionary containing the following keys
################### Examples
>>> ying = Ying()
>>> parameters = ying.get_parameters()
>>> print(parameters)
{'fs': 22050, 'w_step': 256, 'W': 2048, 'tau_max': 2048,
'use_token_averaged_ying': False}
######### NOTE This method is useful for understanding the configuration of the Ying feature extraction instance and for debugging purposes.
midi_to_lag(m: int, octave_range: float = 12)
Converts MIDI note number to time lag.
This function calculates the time lag (tau, c(m)) corresponding to a given MIDI note number using the formula provided in the associated reference. The time lag is computed based on the frequency derived from the MIDI number.
- Parameters:
- m (int) – MIDI note number (typically ranging from 0 to 127).
- octave_range (float , optional) – The range of octaves for frequency calculation. Default is 12.
- Returns: The calculated time lag in seconds corresponding to the given MIDI note number.
- Return type: float
################### Examples
>>> midi_to_lag(69) # A4
0.0022727272727272726
>>> midi_to_lag(60) # C4
0.004545454545454545
>>> midi_to_lag(72) # C5
0.0022727272727272726
######### NOTE The standard reference frequency for MIDI note 69 (A4) is 440 Hz.
output_size() → int
Returns the output size of the Ying feature extractor.
This method returns a fixed output size of 1, which is used in the context of feature extraction from audio signals. The output size remains constant regardless of the input data.
- Returns: The output size of the Ying feature extractor, which is always 1.
- Return type: int
################### Examples
>>> ying = Ying()
>>> output_size = ying.output_size()
>>> print(output_size)
1
yingram(x: Tensor)
Extact Ying-based Features.
This class implements the extraction of Ying-based features from audio input. It is designed to be fully differentiable and is suitable for use in various machine learning and audio processing tasks.
fs
Sampling frequency in Hz. Default is 22050.
- Type: int
w_step
Step size for the window in samples. Default is 256.
- Type: int
W
Window size in samples. Default is 2048.
- Type: int
tau_max
Maximum time lag for calculations. Default is 2048.
- Type: int
midi_start
Starting MIDI note number. Default is -5.
- Type: int
midi_end
Ending MIDI note number. Default is 75.
- Type: int
octave_range
Number of MIDI notes per octave. Default is 24.
- Type: int
use_token_averaged_ying
If True, use token-averaged YIN values. Default is False.
Type: bool
Parameters:
- fs (int) – Sampling frequency in Hz. Default is 22050.
- w_step (int) – Step size for the window in samples. Default is 256.
- W (int) – Window size in samples. Default is 2048.
- tau_max (int) – Maximum time lag for calculations. Default is 2048.
- midi_start (int) – Starting MIDI note number. Default is -5.
- midi_end (int) – Ending MIDI note number. Default is 75.
- octave_range (int) – Number of MIDI notes per octave. Default is 24.
- use_token_averaged_ying (bool) – If True, use token-averaged YIN values. Default is False.
Returns: None
################### Examples
Initialize the Ying feature extractor
ying_extractor = Ying()
Process a batch of audio signals
audio_batch = torch.randn(1, 4096) # Simulated audio input ying_features = ying_extractor.yingram(audio_batch)
Output the shape of the extracted features
print(ying_features.shape) # Expected shape: (80, t’)
######### NOTE This class inherits from AbsFeatsExtract, and requires the espnet2 library for audio processing utilities.
yingram_from_cmndf(cmndfs: Tensor) → Tensor
Calculate the Yingram from cumulative Mean Normalized Difference Functions.
This method computes the Yingram from the provided cumulative Mean Normalized Difference Functions (cMNDFs). The Yingram is a representation that captures pitch information based on the input cMNDFs.
- Parameters:cmndfs – torch.Tensor A tensor containing the calculated cumulative mean normalized difference function. For details, refer to models/yin.py or equations (1) and (2) in the associated documentation.
- Returns: The calculated batch Yingram, which is a tensor containing the pitch representation derived from the input cMNDFs.
- Return type: torch.Tensor
################### Examples
>>> cmndfs = torch.randn(10, 2048) # Example input
>>> yingram = self.yingram_from_cmndf(cmndfs)
>>> print(yingram.shape)
torch.Size([10, <num_midis>]) # Output shape depends on midi range