espnet2.gan_svs.post_frontend.s3prl.S3prlPostFrontend

About 3 min

espnet2.gan_svs.post_frontend.s3prl.S3prlPostFrontend

class espnet2.gan_svs.post_frontend.s3prl.S3prlPostFrontend(fs: int | str = 16000, input_fs: int | str = 24000, postfrontend_conf: dict | None = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str | None = None, multilayer_feature: bool = False, layer: int = -1)

Bases: AbsFrontend

S3prlPostFrontend is a pretrained SSL model for VISinger2 Plus. It is based on

the S3prlFrontend and adds a resampler to resample the input audio to the sample rate of the pretrained model.

The target sample rate for the model (default: 16000).

Type: int

input_fs

The input audio sample rate (default: 24000).

Type: int

multilayer_feature

Flag to indicate if multilayer features are used (default: False).

Type: bool

layer

The specific layer to extract features from (default: -1).

Type: int

upstream

The upstream model used for feature extraction.

Type: S3PRLUpstream

featurizer

The featurizer that processes the upstream model’s output.

Type: Featurizer

pretrained_params

A copy of the upstream model’s state dictionary.

Type: dict

frontend_type

Type of the frontend, set to “s3prl”.

Type: str

hop_length

The hop length used in the feature extraction.

Type: int

tile_factor

The factor by which to tile the representations.

Type: int

resampler

Resampler for audio input.

Type: torchaudio.transforms.Resample
Parameters:
- fs (Union *[*int , str ]) – Target sample rate (default: 16000).
- input_fs (Union *[*int , str ]) – Input audio sample rate (default: 24000).
- postfrontend_conf (Optional *[*dict ]) – Configuration dictionary for the postfrontend (default: None).
- download_dir (Optional *[*str ]) – Directory to download pretrained models (default: None).
- multilayer_feature (bool) – Flag to indicate if multilayer features are extracted (default: False).
- layer (int) – Specific layer to extract features from (default: -1).
Raises:ImportError – If S3PRL is not properly installed.

######

Example

Initialize S3prlPostFrontend with default parameters

s3prl_frontend = S3prlPostFrontend()

Initialize with custom parameters

custom_frontend = S3prlPostFrontend(fs=22050, input_fs=44100,

multilayer_feature=True)

####### NOTE The upstream models in S3PRL currently only support 16 kHz audio.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]

Processes input audio through the S3PRL model to extract features.

This method takes audio input and its corresponding lengths, resamples the audio if necessary, and retrieves features from the S3PRL upstream model. The features can be extracted from a specific layer or as a multi-layer representation depending on the class configuration.

Parameters:
- input (torch.Tensor) – A tensor containing the input audio data, typically of shape (batch_size, num_samples).
- input_lengths (torch.Tensor) – A tensor containing the lengths of the input audio for each batch element, typically of shape (batch_size,).
Returns: A tuple containing: : - feats (torch.Tensor): The extracted features from the S3PRL model, shape depends on the configuration.
- feats_lens (torch.Tensor): The lengths of the extracted features for each batch element, shape (batch_size,).
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:
- AssertionError – If input audio does not meet the expected shape
- or if the layer selection conflicts with multilayer feature setting. –

######

Example

>>> model = S3prlPostFrontend()
>>> audio_input = torch.randn(2, 24000)  # Example batch of audio
>>> input_lengths = torch.tensor([24000, 24000])  # Lengths of inputs
>>> features, lengths = model.forward(audio_input, input_lengths)

####### NOTE The input audio must be in the sample rate defined by input_fs. If fs (the model’s required sample rate) differs from input_fs, the input audio will be resampled accordingly.

output_size() → int

Get the output size of the feature extractor.

This method retrieves the output size from the featurizer component of the S3prlPostFrontend. The output size is determined by the configuration of the upstream model used for feature extraction.

Returns: The size of the output features produced by the featurizer.
Return type: int

######

Example

>>> s3prl_frontend = S3prlPostFrontend()
>>> output_size = s3prl_frontend.output_size()
>>> print(output_size)
512  # Example output size based on the upstream model configuration.

reload_pretrained_parameters()

Reloads the pretrained parameters of the S3PRL frontend model.

This method is useful when you want to reset the model to its original pretrained state. It loads the parameters stored in self.pretrained_params back into the upstream model, allowing for experimentation with different initializations or restoring the model after fine-tuning.

Example

>>> model = S3prlPostFrontend()
>>> model.reload_pretrained_parameters()  # Reloads the original parameters

####### NOTE Ensure that the model is properly initialized before calling this method to avoid loading errors.

Raises:
- RuntimeError – If the model’s state_dict does not match the expected
- structure during loading. –