espnet2.gan_svs.post_frontend.s3prl.S3prlPostFrontend
espnet2.gan_svs.post_frontend.s3prl.S3prlPostFrontend
class espnet2.gan_svs.post_frontend.s3prl.S3prlPostFrontend(fs: int | str = 16000, input_fs: int | str = 24000, postfrontend_conf: dict | None = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str | None = None, multilayer_feature: bool = False, layer: int = -1)
Bases: AbsFrontend
S3prlPostFrontend is a pretrained SSL model for VISinger2 Plus. It is based on
the S3prlFrontend and adds a resampler to resample the input audio to the sample rate of the pretrained model.
fs
The target sample rate for the model (default: 16000).
- Type: int
input_fs
The input audio sample rate (default: 24000).
- Type: int
multilayer_feature
Flag to indicate if multilayer features are used (default: False).
- Type: bool
layer
The specific layer to extract features from (default: -1).
- Type: int
upstream
The upstream model used for feature extraction.
- Type: S3PRLUpstream
featurizer
The featurizer that processes the upstream model’s output.
- Type: Featurizer
pretrained_params
A copy of the upstream model’s state dictionary.
- Type: dict
frontend_type
Type of the frontend, set to “s3prl”.
- Type: str
hop_length
The hop length used in the feature extraction.
- Type: int
tile_factor
The factor by which to tile the representations.
- Type: int
resampler
Resampler for audio input.
Type: torchaudio.transforms.Resample
Parameters:
- fs (Union *[*int , str ]) – Target sample rate (default: 16000).
- input_fs (Union *[*int , str ]) – Input audio sample rate (default: 24000).
- postfrontend_conf (Optional *[*dict ]) – Configuration dictionary for the postfrontend (default: None).
- download_dir (Optional *[*str ]) – Directory to download pretrained models (default: None).
- multilayer_feature (bool) – Flag to indicate if multilayer features are extracted (default: False).
- layer (int) – Specific layer to extract features from (default: -1).
Raises:ImportError – If S3PRL is not properly installed.
######
Example
Initialize S3prlPostFrontend with default parameters
s3prl_frontend = S3prlPostFrontend()
Initialize with custom parameters
custom_frontend = S3prlPostFrontend(fs=22050, input_fs=44100,
multilayer_feature=True)
####### NOTE The upstream models in S3PRL currently only support 16 kHz audio.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]
Processes input audio through the S3PRL model to extract features.
This method takes audio input and its corresponding lengths, resamples the audio if necessary, and retrieves features from the S3PRL upstream model. The features can be extracted from a specific layer or as a multi-layer representation depending on the class configuration.
- Parameters:
- input (torch.Tensor) – A tensor containing the input audio data, typically of shape (batch_size, num_samples).
- input_lengths (torch.Tensor) – A tensor containing the lengths of the input audio for each batch element, typically of shape (batch_size,).
- Returns: A tuple containing: : - feats (torch.Tensor): The extracted features from the S3PRL model, shape depends on the configuration.
- feats_lens (torch.Tensor): The lengths of the extracted features for each batch element, shape (batch_size,).
- Return type: Tuple[torch.Tensor, torch.Tensor]
- Raises:
- AssertionError – If input audio does not meet the expected shape
- or if the layer selection conflicts with multilayer feature setting. –
######
Example
>>> model = S3prlPostFrontend()
>>> audio_input = torch.randn(2, 24000) # Example batch of audio
>>> input_lengths = torch.tensor([24000, 24000]) # Lengths of inputs
>>> features, lengths = model.forward(audio_input, input_lengths)
####### NOTE The input audio must be in the sample rate defined by input_fs. If fs (the model’s required sample rate) differs from input_fs, the input audio will be resampled accordingly.
output_size() → int
Get the output size of the feature extractor.
This method retrieves the output size from the featurizer component of the S3prlPostFrontend. The output size is determined by the configuration of the upstream model used for feature extraction.
- Returns: The size of the output features produced by the featurizer.
- Return type: int
######
Example
>>> s3prl_frontend = S3prlPostFrontend()
>>> output_size = s3prl_frontend.output_size()
>>> print(output_size)
512 # Example output size based on the upstream model configuration.
reload_pretrained_parameters()
Reloads the pretrained parameters of the S3PRL frontend model.
This method is useful when you want to reset the model to its original pretrained state. It loads the parameters stored in self.pretrained_params back into the upstream model, allowing for experimentation with different initializations or restoring the model after fine-tuning.
Example
>>> model = S3prlPostFrontend()
>>> model.reload_pretrained_parameters() # Reloads the original parameters
####### NOTE Ensure that the model is properly initialized before calling this method to avoid loading errors.
- Raises:
- RuntimeError – If the model’s state_dict does not match the expected
- structure during loading. –