espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingLayer

About 2 min

espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingLayer

class espnet2.gan_tts.vits.residual_coupling.ResidualAffineCouplingLayer(in_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, base_dilation: int = 1, layers: int = 5, stacks: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, use_weight_norm: bool = True, bias: bool = True, use_only_mean: bool = True)

Bases: Module

Residual affine coupling layer for use in Conditional Variational Autoencoder

with Adversarial Learning for End-to-End Text-to-Speech.

This layer is a component of the VITS model and implements the residual affine coupling mechanism. It processes input tensors to learn a mapping from input to output while maintaining the ability to invert the transformation.

half_channels

Half of the input channels.

Type: int

use_only_mean

Flag to determine if only the mean is estimated.

Type: bool
Parameters:
- in_channels (int) – Number of input channels (must be divisible by 2).
- hidden_channels (int) – Number of hidden channels for the WaveNet.
- kernel_size (int) – Kernel size for the WaveNet convolution.
- base_dilation (int) – Base dilation factor for the WaveNet.
- layers (int) – Number of layers in the WaveNet.
- stacks (int) – Number of stacks in the WaveNet.
- global_channels (int) – Number of global conditioning channels.
- dropout_rate (float) – Dropout rate applied to the WaveNet.
- use_weight_norm (bool) – Flag to use weight normalization in the WaveNet.
- bias (bool) – Flag to include bias parameters in the WaveNet.
- use_only_mean (bool) – Flag to determine if only the mean is estimated.
Returns: If inverse is False, returns the output tensor and the log-determinant tensor for negative log-likelihood (NLL). If inverse is True, returns the output tensor only.
Return type: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]

####### Examples

>>> layer = ResidualAffineCouplingLayer(in_channels=192)
>>> x = torch.randn(1, 192, 10)  # Batch size 1, 192 channels, length 10
>>> x_mask = torch.ones(1, 1, 10)  # No masking
>>> output, logdet = layer(x, x_mask)  # Forward pass
>>> inverse_output = layer(output, x_mask, inverse=True)  # Inverse pass

NOTE

This implementation is based on the VITS model and follows the architecture outlined in the paper: https://arxiv.org/abs/2006.04558.

Initialzie ResidualAffineCouplingLayer module.

Parameters:
- in_channels (int) – Number of input channels.
- hidden_channels (int) – Number of hidden channels.
- kernel_size (int) – Kernel size for WaveNet.
- base_dilation (int) – Base dilation factor for WaveNet.
- layers (int) – Number of layers of WaveNet.
- stacks (int) – Number of stacks of WaveNet.
- global_channels (int) – Number of global channels.
- dropout_rate (float) – Dropout rate.
- use_weight_norm (bool) – Whether to use weight normalization in WaveNet.
- bias (bool) – Whether to use bias paramters in WaveNet.
- use_only_mean (bool) – Whether to estimate only mean.

forward(x: Tensor, x_mask: Tensor, g: Tensor | None = None, inverse: bool = False) → Tensor | Tuple[Tensor, Tensor]

Calculate forward propagation.

This method performs the forward pass of the residual affine coupling layer. Depending on the inverse flag, it can either apply the flow or its inverse.

Parameters:
- x (Tensor) – Input tensor of shape (B, in_channels, T).
- x_mask (Tensor) – Length tensor of shape (B,).
- g (Optional *[*Tensor ]) – Global conditioning tensor of shape (B, global_channels, 1). Default is None.
- inverse (bool) – If True, apply the inverse flow. Default is False.
Returns:
- If inverse is False: Output tensor of shape (B, in_channels, T). Additionally, returns a tensor of log-determinants for NLL of shape (B,).
- If inverse is True: Output tensor of shape (B, in_channels, T).
Return type: Union[Tensor, Tuple[Tensor, Tensor]]

####### Examples

>>> layer = ResidualAffineCouplingLayer()
>>> x = torch.randn(8, 192, 100)  # Example input
>>> x_mask = torch.ones(8, 1, 100)  # Example mask
>>> output = layer.forward(x, x_mask)
>>> inverse_output = layer.forward(x, x_mask, inverse=True)