espnet2.gan_tts.melgan.melgan.MelGANGenerator
espnet2.gan_tts.melgan.melgan.MelGANGenerator
class espnet2.gan_tts.melgan.melgan.MelGANGenerator(in_channels: int = 80, out_channels: int = 1, kernel_size: int = 7, channels: int = 512, bias: bool = True, upsample_scales: List[int] = [8, 8, 2, 2], stack_kernel_size: int = 3, stacks: int = 3, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.2}, pad: str = 'ReflectionPad1d', pad_params: Dict[str, Any] = {}, use_final_nonlinear_activation: bool = True, use_weight_norm: bool = True)
Bases: Module
MelGAN generator module.
This module implements the MelGAN generator architecture, which is designed for generating audio waveforms from Mel spectrograms. It utilizes a series of convolutional layers, upsampling layers, and residual connections to produce high-quality audio outputs.
melgan
The sequential model comprising various layers for processing the input tensor.
Type: torch.nn.Sequential
Parameters:
- in_channels (int) – Number of input channels (default: 80).
- out_channels (int) – Number of output channels (default: 1).
- kernel_size (int) – Kernel size of initial and final conv layer (default: 7).
- channels (int) – Initial number of channels for conv layer (default: 512).
- bias (bool) – Whether to add bias parameter in convolution layers (default: True).
- upsample_scales (List *[*int ]) – List of upsampling scales (default: [8, 8, 2, 2]).
- stack_kernel_size (int) – Kernel size of dilated conv layers in residual stack (default: 3).
- stacks (int) – Number of stacks in a single residual stack (default: 3).
- nonlinear_activation (str) – Activation function module name (default: “LeakyReLU”).
- nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function (default: {“negative_slope”: 0.2}).
- pad (str) – Padding function module name before dilated convolution layer (default: “ReflectionPad1d”).
- pad_params (Dict *[*str , Any ]) – Hyperparameters for padding function (default: {}).
- use_final_nonlinear_activation (bool) – Whether to use final activation function (default: True).
- use_weight_norm (bool) – Whether to use weight normalization (default: True).
Raises:AssertionError – If hyperparameters are invalid, such as the number of channels or kernel size.
############### Examples
>>> generator = MelGANGenerator(in_channels=80, out_channels=1)
>>> input_tensor = torch.randn(1, 80, 100) # Batch size of 1, 80 channels, length 100
>>> output_tensor = generator(input_tensor)
>>> print(output_tensor.shape) # Should output: (1, 1, 1600) based on upsampling scales
Initialize MelGANGenerator module.
- Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- kernel_size (int) – Kernel size of initial and final conv layer.
- channels (int) – Initial number of channels for conv layer.
- bias (bool) – Whether to add bias parameter in convolution layers.
- upsample_scales (List *[*int ]) – List of upsampling scales.
- stack_kernel_size (int) – Kernel size of dilated conv layers in residual stack.
- stacks (int) – Number of stacks in a single residual stack.
- nonlinear_activation (str) – Activation function module name.
- nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function.
- pad (str) – Padding function module name before dilated convolution layer.
- pad_params (Dict *[*str , Any ]) – Hyperparameters for padding function.
- use_final_nonlinear_activation (torch.nn.Module) – Activation function for the final layer.
- use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.
apply_weight_norm()
Apply weight normalization module from all of the layers.
This method iterates through all the layers of the model and applies weight normalization to each convolutional layer. Weight normalization can help in stabilizing the training of deep networks by improving the conditioning of the optimization problem.
NOTE
This method is called during the initialization of the model if use_weight_norm is set to True.
############### Examples
>>> model = MelGANGenerator(use_weight_norm=True)
>>> model.apply_weight_norm()
forward(c: Tensor) → Tensor
Calculate forward propagation.
This method computes the forward pass of the MelGAN generator. It takes an input tensor and processes it through the generator network to produce an output tensor.
Parameters:c (Tensor) – Input tensor of shape (B, channels, T), where:
- B is the batch size.
- channels is the number of input channels.
- T is the sequence length.
Returns: Output tensor of shape (B, 1, T ** prod(upsample_scales)), where:
- The output tensor has a single channel and its length is the
result of upsampling the input sequence.
Return type: Tensor
############### Examples
>>> generator = MelGANGenerator()
>>> input_tensor = torch.randn(8, 80, 100) # Example input
>>> output_tensor = generator(input_tensor)
>>> print(output_tensor.shape)
torch.Size([8, 1, 6400]) # Example output shape
inference(c: Tensor) → Tensor
Perform inference.
This method processes the input tensor through the MelGAN generator to produce an output tensor, which can be used for audio synthesis.
- Parameters:c (Tensor) – Input tensor of shape (T, in_channels) where T is the length of the input sequence and in_channels is the number of input channels (typically the number of Mel frequency bands).
- Returns: Output tensor of shape (T ** prod(upsample_scales), out_channels) where out_channels is the number of output channels (typically 1 for mono audio).
- Return type: Tensor
############### Examples
>>> generator = MelGANGenerator()
>>> mel_input = torch.randn(100, 80) # Example input tensor
>>> output = generator.inference(mel_input)
>>> print(output.shape) # Should print (800, 1) if upsample_scales = [8, 8, 2, 2]
remove_weight_norm()
Remove weight normalization module from all of the layers.
This method iterates through all layers of the MelGAN generator and removes weight normalization from each layer that has it applied. Weight normalization is a technique used to stabilize the training of deep networks, but there may be cases where it is desirable to remove it.
It utilizes the torch.nn.utils.remove_weight_norm function, which raises a ValueError if the module does not have weight normalization applied. This method handles that exception and logs the removal of weight normalization.
############### Examples
>>> generator = MelGANGenerator()
>>> generator.apply_weight_norm() # Apply weight normalization first
>>> generator.remove_weight_norm() # Now remove weight normalization
NOTE
This function modifies the internal state of the model. Ensure that the model is in the appropriate state before calling this method.
reset_parameters()
Reset parameters.
This method reinitializes the weights of the convolutional layers in the MelGAN generator according to the official implementation. It uses a normal distribution with a mean of 0 and a standard deviation of 0.02 for the weights of the Conv1d and ConvTranspose1d layers. This can be useful for ensuring that the model starts training with reasonable weight values.
This initialization follows the official implementation manner as described in the following link: https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py
############### Examples
>>> generator = MelGANGenerator()
>>> generator.reset_parameters() # Reinitialize weights of layers