espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder

About 2 min

espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder

class espnet2.gan_tts.vits.posterior_encoder.PosteriorEncoder(in_channels: int = 513, out_channels: int = 192, hidden_channels: int = 192, kernel_size: int = 5, layers: int = 16, stacks: int = 1, base_dilation: int = 1, global_channels: int = -1, dropout_rate: float = 0.0, bias: bool = True, use_weight_norm: bool = True)

Bases: Module

Posterior encoder module in VITS.

This code is based on https://github.com/jaywalnut310/vits.

This is a module of posterior encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

input_conv

1D convolutional layer for input processing.

Type:Conv1d

encoder

WaveNet architecture for encoding the input.

Type:WaveNet

proj

1D convolutional layer for projecting the output.

Type:Conv1d
Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- hidden_channels (int) – Number of hidden channels.
- kernel_size (int) – Kernel size in WaveNet.
- layers (int) – Number of layers of WaveNet.
- stacks (int) – Number of repeat stacking of WaveNet.
- base_dilation (int) – Base dilation factor.
- global_channels (int) – Number of global conditioning channels.
- dropout_rate (float) – Dropout rate.
- bias (bool) – Whether to use bias parameters in conv.
- use_weight_norm (bool) – Whether to apply weight norm.

####### Examples

>>> encoder = PosteriorEncoder()
>>> x = torch.randn(8, 513, 100)  # Example input tensor
>>> x_lengths = torch.tensor([100] * 8)  # Lengths of each sequence
>>> z, m, logs, mask = encoder(x, x_lengths)

Returns:
- Encoded hidden representation tensor (B, out_channels, T_feats).
- Projected mean tensor (B, out_channels, T_feats).
- Projected scale tensor (B, out_channels, T_feats).
- Mask tensor for input tensor (B, 1, T_feats).
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
Raises:
- ValueError – If the dimensions of the input tensor do not match the expected
- dimensions based on in_channels and x_lengths. –

Initilialize PosteriorEncoder module.

Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- hidden_channels (int) – Number of hidden channels.
- kernel_size (int) – Kernel size in WaveNet.
- layers (int) – Number of layers of WaveNet.
- stacks (int) – Number of repeat stacking of WaveNet.
- base_dilation (int) – Base dilation factor.
- global_channels (int) – Number of global conditioning channels.
- dropout_rate (float) – Dropout rate.
- bias (bool) – Whether to use bias parameters in conv.
- use_weight_norm (bool) – Whether to apply weight norm.

forward(x: Tensor, x_lengths: Tensor, g: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor, Tensor]

Calculate forward propagation.

This method computes the forward pass of the PosteriorEncoder, taking an input tensor and producing an encoded representation along with its corresponding mean, scale, and mask tensors. This is essential for the posterior encoding in the Conditional Variational Autoencoder for text-to-speech tasks.

Parameters:
- x (Tensor) – Input tensor of shape (B, in_channels, T_feats), where B is the batch size, in_channels is the number of input channels, and T_feats is the number of feature frames.
- x_lengths (Tensor) – Length tensor of shape (B,) that indicates the valid length of each sequence in the batch.
- g (Optional *[*Tensor ]) – Global conditioning tensor of shape (B, global_channels, 1). This tensor is used for additional conditioning information, if available.
Returns: A tuple containing: : - Tensor: Encoded hidden representation tensor of shape : (B, out_channels, T_feats).
- Tensor: Projected mean tensor of shape (B, out_channels, T_feats).
- Tensor: Projected scale tensor of shape (B, out_channels, T_feats).
- Tensor: Mask tensor for the input tensor of shape (B, 1, T_feats).
Return type: Tuple[Tensor, Tensor, Tensor, Tensor]

####### Examples

>>> encoder = PosteriorEncoder()
>>> x = torch.randn(8, 513, 100)  # Batch of 8, 513 input channels, 100 features
>>> x_lengths = torch.tensor([100, 80, 100, 100, 50, 100, 70, 100])
>>> output = encoder.forward(x, x_lengths)
>>> z, m, logs, mask = output
>>> print(z.shape)  # Should be (8, 192, 100)
>>> print(m.shape)  # Should be (8, 192, 100)
>>> print(logs.shape)  # Should be (8, 192, 100)
>>> print(mask.shape)  # Should be (8, 1, 100)