espnet2.gan_tts.hifigan.hifigan.HiFiGANGenerator

About 5 min

espnet2.gan_tts.hifigan.hifigan.HiFiGANGenerator

class espnet2.gan_tts.hifigan.hifigan.HiFiGANGenerator(in_channels: int = 80, out_channels: int = 1, channels: int = 512, global_channels: int = -1, kernel_size: int = 7, upsample_scales: List[int] = [8, 8, 2, 2], upsample_kernel_sizes: List[int] = [16, 16, 4, 4], resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_additional_convs: bool = True, bias: bool = True, nonlinear_activation: str = 'LeakyReLU', nonlinear_activation_params: Dict[str, Any] = {'negative_slope': 0.1}, use_weight_norm: bool = True)

Bases: Module

HiFiGAN generator module for high-fidelity audio synthesis.

This module implements the HiFi-GAN generator architecture, which is used for generating high-quality audio waveforms from mel-spectrograms. The implementation is inspired by the work done in the ParallelWaveGAN project.

upsample_factor

The total upsampling factor applied to the input.

Type: int

num_upsamples

The number of upsampling layers.

Type: int

num_blocks

The number of residual blocks per upsampling layer.

Type: int

input_conv

The initial convolution layer.

Type: torch.nn.Conv1d

upsamples

List of upsampling layers.

Type: torch.nn.ModuleList

blocks

List of residual blocks.

Type: torch.nn.ModuleList

output_conv

The final convolution and activation layers.

Type: torch.nn.Sequential

global_conv

Global conditioning convolution layer.

Type: Optional[torch.nn.Conv1d]
Parameters:
- in_channels (int) – Number of input channels (default: 80).
- out_channels (int) – Number of output channels (default: 1).
- channels (int) – Number of hidden representation channels (default: 512).
- global_channels (int) – Number of global conditioning channels (default: -1).
- kernel_size (int) – Kernel size of initial and final conv layer (default: 7).
- upsample_scales (List *[*int ]) – List of upsampling scales (default: [8, 8, 2, 2]).
- upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsample layers (default: [16, 16, 4, 4]).
- resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks (default: [3, 7, 11]).
- resblock_dilations (List *[*List *[*int ] ]) – List of list of dilations for residual blocks (default: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]).
- use_additional_convs (bool) – Whether to use additional conv layers in residual blocks (default: True).
- bias (bool) – Whether to add bias parameter in convolution layers (default: True).
- nonlinear_activation (str) – Activation function module name (default: “LeakyReLU”).
- nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function (default: {“negative_slope”: 0.1}).
- use_weight_norm (bool) – Whether to use weight norm (default: True).
Raises:
- AssertionError – If kernel_size is not odd or if lengths of upsample_scales,
- upsample_kernel_sizes**,** resblock_dilations**,** and resblock_kernel_sizes do not match. –

############### Examples

>>> generator = HiFiGANGenerator()
>>> mel_spectrogram = torch.randn(1, 80, 100)  # Example input
>>> output_waveform = generator(mel_spectrogram)
>>> print(output_waveform.shape)  # Output shape will be (1, 1, T)

######### NOTE The HiFiGAN architecture is designed to synthesize high-quality audio from low-dimensional features such as mel-spectrograms. It utilizes residual blocks and upsampling techniques to achieve high fidelity.

Initialize HiFiGANGenerator module.

Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- channels (int) – Number of hidden representation channels.
- global_channels (int) – Number of global conditioning channels.
- kernel_size (int) – Kernel size of initial and final conv layer.
- upsample_scales (List *[*int ]) – List of upsampling scales.
- upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsample layers.
- resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks.
- resblock_dilations (List *[*List *[*int ] ]) – List of list of dilations for residual blocks.
- use_additional_convs (bool) – Whether to use additional conv layers in residual blocks.
- bias (bool) – Whether to add bias parameter in convolution layers.
- nonlinear_activation (str) – Activation function module name.
- nonlinear_activation_params (Dict *[*str , Any ]) – Hyperparameters for activation function.
- use_weight_norm (bool) – Whether to use weight norm. If set to true, it will be applied to all of the conv layers.

apply_weight_norm()

Apply weight normalization module from all of the layers.

This method applies weight normalization to all convolutional layers in the HiFiGAN generator. Weight normalization can improve the training speed and stability of the model by reparameterizing the weights of the layers.

It is important to note that weight normalization should be applied during the initialization of the model if the use_weight_norm parameter is set to True.

############### Examples

>>> generator = HiFiGANGenerator(use_weight_norm=True)
>>> generator.apply_weight_norm()

######### NOTE This method logs a debug message for each layer that weight normalization is applied to, aiding in tracking the model’s structure during development and debugging.

Raises:None – This method does not raise any exceptions.

forward(c: Tensor, g: Tensor | None = None) → Tensor

Calculate forward propagation.

This method computes the forward pass of the HiFiGAN generator by processing the input tensor through several convolutional layers, upsampling layers, and residual blocks. If a global conditioning tensor is provided, it will be added to the processed input before proceeding through the network.

Parameters:
- c (torch.Tensor) – Input tensor of shape (B, in_channels, T), where B is the batch size, in_channels is the number of input channels, and T is the length of the input sequence.
- g (Optional *[*torch.Tensor ]) – Global conditioning tensor of shape (B, global_channels, 1). This tensor is optional and, if provided, is added to the input tensor after the initial convolution.
Returns: Output tensor of shape (B, out_channels, T), : where out_channels is the number of output channels.
Return type: torch.Tensor

############### Examples

>>> generator = HiFiGANGenerator()
>>> input_tensor = torch.randn(1, 80, 100)  # Example input
>>> output_tensor = generator(input_tensor)
>>> print(output_tensor.shape)  # Output shape should be (1, 1, T)

######### NOTE The input tensor must have the correct number of channels as specified during the initialization of the HiFiGANGenerator. The global conditioning tensor must have the same batch size as the input tensor if provided.

Raises:
- AssertionError – If the input tensor does not match the expected
- shape or if the global conditioning tensor has an incompatible shape. –

inference(c: Tensor, g: Tensor | None = None) → Tensor

Perform inference using the HiFiGAN generator.

This method processes the input tensor and optionally incorporates global conditioning to produce an output tensor. The input tensor should be in the format (T, in_channels), where T is the time dimension. If a global conditioning tensor is provided, it should have the shape (global_channels, 1).

Parameters:
- c (torch.Tensor) – Input tensor with shape (T, in_channels).
- g (Optional *[*torch.Tensor ]) – Global conditioning tensor with shape (global_channels, 1). This tensor is optional and can be set to None.
Returns: Output tensor with shape (T ** upsample_factor, out_channels), : where upsample_factor is the product of the upsampling scales.
Return type: torch.Tensor

############### Examples

>>> generator = HiFiGANGenerator()
>>> input_tensor = torch.randn(100, 80)  # (T, in_channels)
>>> output_tensor = generator.inference(input_tensor)
>>> print(output_tensor.shape)
torch.Size([800, 1])  # Example output shape

>>> global_conditioning = torch.randn(8, 1)  # (global_channels, 1)
>>> output_tensor_with_gc = generator.inference(input_tensor,
...                                              global_conditioning)
>>> print(output_tensor_with_gc.shape)
torch.Size([800, 1])  # Example output shape with global conditioning

remove_weight_norm()

Remove weight normalization module from all of the layers.

This method traverses through all layers of the model and removes the weight normalization applied to convolutional layers. If a layer does not have weight normalization applied, it catches the ValueError and continues without raising an exception.

######### NOTE This method is useful for models that were previously using weight normalization and need to revert back to the standard weight parameters for compatibility or performance reasons.

############### Examples

>>> generator = HiFiGANGenerator()
>>> generator.apply_weight_norm()  # Apply weight normalization
>>> generator.remove_weight_norm()  # Remove weight normalization

Raises:ValueError – If a module does not support weight normalization.

reset_parameters()

Reset parameters of the HiFiGANGenerator.

This initialization follows the official implementation manner as detailed in the HiFi-GAN repository. The weights of convolutional layers are initialized using a normal distribution with a mean of 0 and a standard deviation of 0.01. This method applies to all layers in the generator, specifically targeting Conv1d and ConvTranspose1d modules. It ensures that the model’s parameters are reset to a known state, which can be useful for experimentation or retraining.

############### Examples

>>> generator = HiFiGANGenerator()
>>> generator.reset_parameters()  # Resets parameters to default

######### NOTE This method is typically called during the initialization of the generator, but it can also be called manually to reset the parameters at any time during the model’s lifecycle.

Raises:None – This method does not raise any exceptions.