espnet2.svs.feats_extract.score_feats_extract.FrameScoreFeats

About 4 min

espnet2.svs.feats_extract.score_feats_extract.FrameScoreFeats

class espnet2.svs.feats_extract.score_feats_extract.FrameScoreFeats(fs: int | str = 22050, n_fft: int = 1024, win_length: int = 512, hop_length: int = 128, window: str = 'hann', center: bool = True)

Bases: AbsFeatsExtract

FrameScoreFeats is a feature extraction class for frame-level scoring of audio.

This class inherits from AbsFeatsExtract and is designed to perform feature extraction on audio signals by applying Short-Time Fourier Transform (STFT) techniques. It allows for configuration of various parameters such as sample rate, FFT size, window length, hop length, and window type.

The sampling frequency of the audio signal.

Type: Union[int, str]

n_fft

The number of FFT points.

Type: int

win_length

The length of the window for STFT.

Type: int

hop_length

The number of samples between adjacent frames.

Type: int

window

The type of window to use for STFT.

Type: str

center

Whether to center the input for STFT.

Type: bool
Parameters:
- fs (Union *[*int , str ] , optional) – The sampling frequency. Defaults to 22050.
- n_fft (int , optional) – The number of FFT points. Defaults to 1024.
- win_length (int , optional) – The length of the window. Defaults to 512.
- hop_length (int , optional) – The hop length. Defaults to 128.
- window (str , optional) – The type of window. Defaults to “hann”.
- center (bool , optional) – Whether to center the input. Defaults to True.
Returns: The output size of the feature extraction.
Return type: int

############### Examples

>>> frame_score_feats = FrameScoreFeats()
>>> input_tensor = torch.randn(10, 100, 20)  # (Batch, Nsamples, Label_dim)
>>> output, olens = frame_score_feats.label_aggregate(input_tensor)

Raises:ValueError – If input lengths are not consistent with the expected shapes.

NOTE

The default behavior of label aggregation is compatible with torch.stft regarding framing and padding.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

extra_repr()

Returns a string representation of the FrameScoreFeats parameters.

This method provides a detailed representation of the important parameters of the FrameScoreFeats class, which can be useful for debugging and logging purposes. It includes the window length, hop length, and whether centering is applied.

win_length

The length of the window used in the Short-Time Fourier Transform (STFT).

Type: int

hop_length

The number of samples to skip between successive frames.

Type: int

center

Whether the signal is padded such that frames are centered at the original time step.

Type: bool
Returns: A string representation of the FrameScoreFeats parameters.
Return type: str

############### Examples

>>> frame_score_feats = FrameScoreFeats(win_length=512,
...                                       hop_length=128,
...                                       center=True)
>>> print(frame_score_feats.extra_repr())
win_length=512, hop_length=128, center=True,

FrameScoreFeats forward function.

This method processes the input tensors representing labels, midi, and duration by aggregating them into frames. It handles padding and aggregation of label data based on specified lengths.

Parameters:
- label – A tensor of shape (Batch, Nsamples) representing the labels.
- label_lengths – A tensor of shape (Batch) containing the lengths of each label sequence.
- midi – A tensor of shape (Batch, Nsamples) representing the MIDI data.
- midi_lengths – A tensor of shape (Batch) containing the lengths of each MIDI sequence.
- duration – A tensor of shape (Batch, Nsamples) representing the duration data.
- duration_lengths – A tensor of shape (Batch) containing the lengths of each duration sequence.
Returns:
- label: A tensor of shape (Batch, Frames) for aggregated labels.
- label_lengths: A tensor of shape (Batch) for aggregated label lengths.
- midi: A tensor of shape (Batch, Frames) for aggregated MIDI data.
- midi_lengths: A tensor of shape (Batch) for aggregated MIDI lengths.
- duration: A tensor of shape (Batch, Frames) for aggregated duration data.
- duration_lengths: A tensor of shape (Batch) for aggregated duration lengths.
Return type: A tuple containing

############### Examples

>>> frame_score_feats = FrameScoreFeats()
>>> label_tensor = torch.rand(2, 100)
>>> label_lengths_tensor = torch.tensor([100, 80])
>>> midi_tensor = torch.rand(2, 100)
>>> midi_lengths_tensor = torch.tensor([100, 80])
>>> duration_tensor = torch.rand(2, 100)
>>> duration_lengths_tensor = torch.tensor([100, 80])
>>> outputs = frame_score_feats.forward(
...     label=label_tensor,
...     label_lengths=label_lengths_tensor,
...     midi=midi_tensor,
...     midi_lengths=midi_lengths_tensor,
...     duration=duration_tensor,
...     duration_lengths=duration_lengths_tensor
... )

get_parameters() → Dict[str, Any]

Retrieves the parameters of the FrameScoreFeats instance.

This method returns a dictionary containing the parameters used for feature extraction in the FrameScoreFeats class. The parameters include the sampling frequency, FFT size, hop length, window type, window length, and whether to center the frames.

Returns: A dictionary with the following keys: : - fs: Sampling frequency.
- n_fft: Number of FFT points.
- hop_length: Number of samples between frames.
- window: Window type used for STFT.
- win_length: Length of each window.
- center: Whether the frames are centered.
Return type: dict

############### Examples

>>> frame_score_feats = FrameScoreFeats(fs=44100, n_fft=2048)
>>> params = frame_score_feats.get_parameters()
>>> print(params)
{'fs': 44100, 'n_fft': 2048, 'hop_length': 128,
 'window': 'hann', 'win_length': 512, 'center': True}

label_aggregate(input: Tensor, input_lengths: Tensor | None = None) → Tuple[Tensor, Tensor | None]

Aggregates labels over frames by summing across the label dimension.

This method takes an input tensor representing labels and aggregates them over frames, optionally considering input lengths for masking. The aggregation is performed by summing the values in the label dimension for each frame.

Parameters:
- input – A tensor of shape (Batch, Nsamples, Label_dim) representing the input labels to be aggregated.
- input_lengths – A tensor of shape (Batch) representing the lengths of each input sequence. This is used for masking during aggregation.
Returns: A tensor of shape (Batch, Frames, Label_dim) representing : the aggregated labels for each frame.
olens: An optional tensor representing the lengths of the output : sequences after aggregation. This will be None if input_lengths is not provided.
Return type: output

NOTE

The default behavior of label aggregation is compatible with torch.stft regarding framing and padding.

############### Examples

>>> input_tensor = torch.randn(2, 10, 5)  # Example input
>>> input_lengths = torch.tensor([10, 8])  # Example lengths
>>> output, olens = label_aggregate(input_tensor, input_lengths)
>>> print(output.shape)  # Should print the shape of the aggregated output

output_size() → int

Returns the output size of the feature extraction.

This method provides the output size for the feature extraction process, which is typically used to determine the dimensions of the resulting tensors after the forward pass.

Returns: The output size, which is always 1 for this implementation.
Return type: int

############### Examples

>>> frame_score_feats = FrameScoreFeats()
>>> size = frame_score_feats.output_size()
>>> print(size)
1