espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2VocoderGenerator

About 4 min

espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2VocoderGenerator

class espnet2.gan_svs.visinger2.visinger2_vocoder.VISinger2VocoderGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], n_harmonic: int = 64, use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)

Bases: Module

VISinger2 Vocoder Generator Module.

This class implements the VISinger2 Vocoder Generator, a component of the HiFi-GAN architecture. It synthesizes audio from input features using residual blocks and upsampling techniques.

This implementation is based on the VISinger2 project, available at: https://github.com/zhangyongmao/VISinger2.

upsample_factor

Total upsample factor calculated from the upsample scales.

Type: int

num_upsamples

Number of upsampling layers.

Type: int

num_blocks

Number of residual blocks.

Type: int
Parameters:
- in_channels (int) – Number of input channels (default: 80).
- out_channels (int) – Number of output channels (default: 1).
- channels (int) – Number of hidden representation channels (default: 512).
- global_channels (int) – Number of global conditioning channels (default: -1).
- kernel_size (int) – Kernel size of initial and final conv layer (default: 7).
- upsample_scales (List *[*int ]) – List of upsampling scales (default: [8, 8, 2, 2]).
- upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsample layers (default: [16, 16, 4, 4]).
- resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks (default: [3, 7, 11]).
- resblock_dilations (List *[*List *[*int ] ]) – List of lists of dilations for residual blocks (default: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]).
- n_harmonic (int) – Number of harmonics used to synthesize a sound signal (default: 64).
- use_additional_convs (bool) – Whether to use additional conv layers in residual blocks (default: True).
- bias (bool) – Whether to add bias parameter in convolution layers (default: True).
- nonlinear_activation (str) – Activation function module name (default: “LeakyReLU”).
- nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function (default: {“negative_slope”: 0.1}).
- use_weight_norm (bool) – Whether to use weight norm (default: True). If set to true, it will be applied to all of the conv layers.

############# Examples

Initialize the generator

generator = VISinger2VocoderGenerator()

Forward pass with input tensors

output = generator(c, ddsp, g)

Raises:AssertionError – If the kernel size is not an odd number or if the lengths of upsample_scales and upsample_kernel_sizes do not match.

######## NOTE This implementation follows the official HiFi-GAN generator structure while integrating VISinger2 specific features.

Initialize HiFiGANGenerator module.

Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- channels (int) – Number of hidden representation channels.
- global_channels (int) – Number of global conditioning channels.
- kernel_size (int) – Kernel size of initial and final conv layer.
- upsample_scales (List *[*int ]) – List of upsampling scales.
- upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsample layers.
- resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks.
- resblock_dilations (List *[*List *[*int ] ]) – List of list of dilations for residual blocks.
- n_harmonic (int) – Number of harmonics used to synthesize a sound signal.
- use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.
- bias (bool) – Whether to add bias parameter in convolution layers.
- nonlinear_activation (str) – Activation function module name.
- nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function.
- use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()

Apply weight normalization module to all convolutional layers.

This method iterates through all the layers of the module and applies weight normalization to each layer that is an instance of either torch.nn.Conv1d or torch.nn.ConvTranspose1d. This normalization can help stabilize training by reducing the risk of exploding or vanishing gradients.

The weight normalization is implemented using torch.nn.utils.weight_norm.

############# Examples

>>> model = VISinger2VocoderGenerator()
>>> model.apply_weight_norm()
# This applies weight normalization to all applicable layers in the model.

######## NOTE Weight normalization is generally recommended for stabilizing the training of GANs and similar architectures.

forward(c, ddsp, g: Tensor | None = None) → Tensor

Calculate forward propagation.

This method computes the forward pass for the VISinger2 Vocoder Generator, processing input tensors through several layers of convolution and residual blocks to produce an output tensor.

Parameters:
- c (Tensor) – Input tensor of shape (B, in_channels, T).
- ddsp (Tensor) – Input tensor of shape (B, n_harmonic + 2, T * hop_length).
- g (Optional *[*Tensor ]) – Global conditioning tensor of shape (B, global_channels, 1). Defaults to None.
Returns: Output tensor of shape (B, out_channels, T).
Return type: Tensor

############# Examples

>>> generator = VISinger2VocoderGenerator()
>>> c = torch.randn(1, 80, 256)  # Example input tensor
>>> ddsp = torch.randn(1, 66, 1024)  # Example DDSP input tensor
>>> output = generator.forward(c, ddsp)
>>> print(output.shape)  # Output tensor shape
torch.Size([1, 1, 256])

remove_weight_norm()

Remove weight normalization module from all of the layers.

This method iterates through all the layers of the model and removes weight normalization if it has been applied. It is useful when you want to switch from weight-normalized layers back to standard layers, typically before saving the model or for inference purposes.

Raises:
- ValueError – If a layer does not have weight normalization applied,
- this will be caught and logged**,** but no exception will be raised. –

############# Examples

>>> model = VISinger2VocoderGenerator(use_weight_norm=True)
>>> model.remove_weight_norm()  # Removes weight normalization from all layers

######## NOTE The removal of weight normalization can affect the performance of the model, so it should be used with caution. Make sure to validate the model after this operation.

reset_parameters()

Reset parameters.

This initialization follows the official implementation manner. It resets the weights of convolutional layers to a normal distribution with mean 0 and standard deviation 0.01.

The method iterates through all modules of the model and applies the parameter reset to each convolutional layer.

############# Examples

>>> model = VISinger2VocoderGenerator()
>>> model.reset_parameters()  # Reset parameters to their initial values

######## NOTE This method is typically called during the initialization of the model to ensure that the parameters start from a well-defined state.

Raises:None –