espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2Discriminator

About 2 min

espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2Discriminator

class espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2Discriminator(scales: int = 1, scale_downsample_pooling: str = 'AvgPool1d', scale_downsample_pooling_params: Dict[str, Any] = {'kernel_size': 4, 'padding': 2, 'stride': 2}, scale_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, follow_official_norm: bool = True, periods: List[int] = [2, 3, 5, 7, 11], period_discriminator_params: Dict[str, Any] = {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, multi_freq_disc_params: Dict[str, Any] = {'divisors': [32, 16, 8, 4, 2, 1, 1], 'domain': 'double', 'hidden_channels': [256, 512, 512], 'hop_length_factors': [4, 8, 16], 'mel_scale': True, 'sample_rate': 22050, 'strides': [1, 2, 1, 2, 1, 2, 1]})

Bases: Module

Discriminator module for VISinger2, including MSD, MPD, and MFD.

This class implements a multi-scale, multi-period, and multi-frequency discriminator for the VISinger2 vocoder architecture. It leverages various discriminators to evaluate the quality of generated audio signals.

Parameters:
- scales (int) – Number of scales to be used in the multi-scale discriminator.
- scale_downsample_pooling (str) – Type of pooling used for downsampling.
- scale_downsample_pooling_params (Dict *[*str , Any ]) – Parameters for the downsampling pooling layer.
- scale_discriminator_params (Dict *[*str , Any ]) – Parameters for the scale discriminator.
- follow_official_norm (bool) – Whether to follow the official normalization.
- periods (List *[*int ]) – List of periods to be used in the multi-period discriminator.
- period_discriminator_params (Dict *[*str , Any ]) – Parameters for the period discriminator.
- multi_freq_disc_params (Dict *[*str , Any ]) – Parameters for the multi-frequency discriminator.
- use_spectral_norm (bool) – Whether to use spectral normalization or not.
Returns: The outputs from the various discriminators.
Return type: List[Tensor]

####### Examples

>>> discriminator = VISinger2Discriminator(scales=2)
>>> input_tensor = torch.randn(1, 1, 22050)  # Example input
>>> outputs = discriminator(input_tensor)
>>> print(len(outputs))  # Outputs from different discriminators

NOTE

The multi-scale discriminator is implemented using HiFiGAN’s HiFiGANMultiScaleDiscriminator, the multi-period discriminator with HiFiGANMultiPeriodDiscriminator, and the multi-frequency discriminator with MultiFrequencyDiscriminator.

Discriminator module for VISinger2, including MSD, MPD, and MFD.

Parameters:
- scales (int) – Number of scales to be used in the multi-scale discriminator.
- scale_downsample_pooling (str) – Type of pooling used for downsampling.
- scale_downsample_pooling_params (Dict *[*str , Any ]) – Parameters for the downsampling pooling layer.
- scale_discriminator_params (Dict *[*str , Any ]) – Parameters for the scale discriminator.
- follow_official_norm (bool) – Whether to follow the official normalization.
- periods (List *[*int ]) – List of periods to be used in the multi-period discriminator.
- period_discriminator_params (Dict *[*str , Any ]) – Parameters for the period discriminator.
- multi_freq_disc_params (Dict *[*str , Any ]) – Parameters for the multi-frequency discriminator.
- use_spectral_norm (bool) – Whether to use spectral normalization or not.

forward(x)

Calculate forward propagation.

This method performs the forward pass of the VISinger2 Vocoder Generator, taking in conditioning and DDSP inputs to produce an audio output.

Parameters:
- c (Tensor) – Input tensor (B, in_channels, T), where B is the batch size, in_channels is the number of input channels, and T is the time dimension.
- ddsp (Tensor) – Input tensor (B, n_harmonic + 2, T * hop_length), representing the DDSP features.
- g (Optional *[*Tensor ]) – Global conditioning tensor (B, global_channels, 1). This can be used to provide additional context to the generation process.
Returns: Output tensor (B, out_channels, T), which contains the : generated audio signal.
Return type: Tensor

####### Examples

>>> generator = VISinger2VocoderGenerator()
>>> c = torch.randn(1, 80, 100)  # Example conditioning input
>>> ddsp = torch.randn(1, 66, 1000)  # Example DDSP input
>>> output = generator.forward(c, ddsp)
>>> print(output.shape)  # Should be (1, 1, 100)

NOTE

Ensure that the input tensors are properly shaped according to the expected dimensions, as mismatches can lead to runtime errors.