espnet2.svs.feats_extract.score_feats_extract.SyllableScoreFeats

About 5 min

espnet2.svs.feats_extract.score_feats_extract.SyllableScoreFeats

class espnet2.svs.feats_extract.score_feats_extract.SyllableScoreFeats(fs: int | str = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)

Bases: AbsFeatsExtract

SyllableScoreFeats class for extracting syllable-level features from audio data.

This class extends the AbsFeatsExtract class and is designed to handle syllable-level features, particularly for speech synthesis tasks. It provides methods for segmenting input data into syllables and aggregating features based on specified parameters.

Sampling frequency (default: 22050).

Type: Union[int, str]

n_fft

Number of FFT points (default: 1024).

Type: int

win_length

Window length for STFT (default: 512).

Type: int

hop_length

Hop length for STFT (default: 128).

Type: int

window

Type of window function to use (default: “hann”).

Type: str

center

Whether to center the window (default: True).

Type: bool
Parameters:
- fs (Union *[*int , str ]) – Sampling frequency (default: 22050).
- n_fft (int) – Number of FFT points (default: 1024).
- win_length (int) – Window length for STFT (default: 512).
- hop_length (int) – Hop length for STFT (default: 128).
- window (str) – Type of window function to use (default: “hann”).
- center (bool) – Whether to center the window (default: True).

############### Examples

Create an instance of SyllableScoreFeats

syllable_feats = SyllableScoreFeats(fs=16000, n_fft=2048)

Forward pass with sample inputs

output = syllable_feats.forward(

label=torch.tensor([[1, 2, 1, 3]]), label_lengths=torch.tensor([4]), midi=torch.tensor([[60, 62, 64, 65]]), midi_lengths=torch.tensor([4]), duration=torch.tensor([[0.5, 0.5, 0.5, 0.5]]), duration_lengths=torch.tensor([4]),

)

Returns: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, : torch.Tensor, torch.Tensor]: A tuple containing:
- seg_label: (Batch, Frames) extracted labels.
- seg_label_lengths: (Batch) lengths of the extracted labels.
- seg_midi: (Batch, Frames) extracted MIDI notes.
- seg_midi_lengths: (Batch) lengths of the extracted MIDI notes.
- seg_duration: (Batch, Frames) extracted durations.
- seg_duration_lengths: (Batch) lengths of the extracted durations.
Raises:AssertionError – If the shapes of the inputs do not match as expected.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr()

Returns a string representation of the SyllableScoreFeats instance.

This method provides a concise summary of the key parameters of the SyllableScoreFeats class, which can be useful for debugging and logging purposes. It includes the window length, hop length, and whether the centering is applied.

win_length

The length of the window used for the STFT.

Type: int

hop_length

The number of samples to hop between frames.

Type: int

center

Indicates if the input signal is centered.

Type: bool
Returns: A formatted string containing the key parameters of the instance.
Return type: str

############### Examples

>>> syllable_score_feats = SyllableScoreFeats(win_length=256, hop_length=128)
>>> print(syllable_score_feats.extra_repr())
win_length=256, hop_length=128, center=True,

SyllableScoreFeats forward function.

This method processes the input tensors for labels, midi, and duration by aggregating them into frame-level representations. It ensures that the input tensors have compatible shapes and returns the aggregated outputs along with their respective lengths.

Parameters:
- label – Tensor of shape (Batch, Nsamples) representing the labels.
- label_lengths – Tensor of shape (Batch) representing the lengths of each label sequence.
- midi – Tensor of shape (Batch, Nsamples) representing the MIDI information.
- midi_lengths – Tensor of shape (Batch) representing the lengths of each MIDI sequence.
- duration – Tensor of shape (Batch, Nsamples) representing the duration information.
- duration_lengths – Tensor of shape (Batch) representing the lengths of each duration sequence.
Returns:
- label: Aggregated label tensor of shape (Batch, Frames).
- label_lengths: Aggregated lengths tensor of shape (Batch).
- midi: Aggregated MIDI tensor of shape (Batch, Frames).
- midi_lengths: Aggregated lengths tensor of shape (Batch).
- duration: Aggregated duration tensor of shape (Batch, Frames).
- duration_lengths: Aggregated lengths tensor of shape (Batch).
Return type: A tuple containing
Raises:AssertionError – If the shapes of the input tensors are not compatible.

############### Examples

>>> label = torch.tensor([[1, 2, 3], [4, 5, 6]])
>>> label_lengths = torch.tensor([3, 3])
>>> midi = torch.tensor([[60, 61, 62], [63, 64, 65]])
>>> midi_lengths = torch.tensor([3, 3])
>>> duration = torch.tensor([[100, 200, 300], [400, 500, 600]])
>>> duration_lengths = torch.tensor([3, 3])
>>> model = SyllableScoreFeats()
>>> output = model.forward(label, label_lengths, midi, midi_lengths,
...                         duration, duration_lengths)
>>> print(output)

get_parameters() → Dict[str, Any]

Retrieve the parameters of the SyllableScoreFeats instance.

This method returns a dictionary containing the parameters of the SyllableScoreFeats instance, which are used for feature extraction in syllable scoring tasks. The parameters include sampling rate, FFT size, hop length, window type, window length, and whether the STFT is centered.

Returns: A dictionary containing the parameters: : - fs (Union[int, str]): The sampling rate.
- n_fft (int): The size of the FFT.
- hop_length (int): The number of samples between each frame.
- window (str): The type of window applied to each frame.
- win_length (int): The length of each window.
- center (bool): Whether the STFT is centered.
Return type: Dict[str, Any]

############### Examples

>>> syllable_score_feats = SyllableScoreFeats()
>>> params = syllable_score_feats.get_parameters()
>>> print(params)
{
    'fs': 22050,
    'n_fft': 1024,
    'hop_length': 128,
    'window': 'hann',
    'win_length': 512,
    'center': True
}

Extracts segments from the provided label, midi, and duration tensors.

This method identifies the segments based on changes in the label and midi tensors, extracting corresponding values from the label, midi, and duration inputs. It returns the segmented values along with their lengths.

Parameters:
- label – A tensor of shape (Nsamples,) representing the label data.
- label_lengths – A tensor indicating the lengths of each sample in the label.
- midi – A tensor of shape (Nsamples,) representing the midi data.
- midi_lengths – A tensor indicating the lengths of each sample in the midi.
- duration – A tensor of shape (Nsamples,) representing the duration data.
- duration_lengths – A tensor indicating the lengths of each sample in the duration.
Returns:
- seg_label: List of segmented labels.
- lengths: Number of segments for the labels.
- seg_midi: List of segmented midi values.
- lengths: Number of segments for the midi.
- seg_duration: List of segmented durations.
- lengths: Number of segments for the duration.
Return type: A tuple containing

############### Examples

>>> label = torch.tensor([0, 0, 1, 1, 0])
>>> label_lengths = torch.tensor(5)
>>> midi = torch.tensor([60, 60, 62, 62, 60])
>>> midi_lengths = torch.tensor(5)
>>> duration = torch.tensor([0.5, 0.5, 0.5, 0.5, 0.5])
>>> duration_lengths = torch.tensor(5)
>>> segments = get_segments(label, label_lengths, midi, midi_lengths,
                            duration, duration_lengths)
>>> print(segments)
( [0, 1, 0], 2, [60, 62, 60], 2, [0.5, 0.5, 0.5], 2)

NOTE

The input tensors should have matching lengths, and the function assumes the data is structured correctly. The function will raise an error if any of the input tensors do not match the expected shape or if any input is None.

output_size() → int

Returns the output size of the feature extraction process.

This method provides the output size for the SyllableScoreFeats class, which is currently set to return a fixed value of 1. This can be useful for understanding the dimensionality of the output when using this feature extraction class.

Returns: The output size, which is always 1.
Return type: int

############### Examples

>>> syllable_score_feats = SyllableScoreFeats()
>>> output_size = syllable_score_feats.output_size()
>>> print(output_size)
1