espnet2.gan_svs.vits.pitch_predictor.Decoder

About 2 min

espnet2.gan_svs.vits.pitch_predictor.Decoder

class espnet2.gan_svs.vits.pitch_predictor.Decoder(out_channels: int = 192, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, pw_layer_type: str = 'conv1d', pw_conv_kernel_size: int = 3, pos_enc_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, global_channels: int = -1)

Bases: Module

Pitch or Mel decoder module in VISinger 2.

This class implements a decoder used in the VISinger 2 model for pitch or Mel synthesis. It employs attention mechanisms and various configurations to adapt to different architectures and requirements.

prenet

The initial convolutional layer to process input.

Type: torch.nn.Conv1d

decoder

The main encoder block which utilizes attention.

Type:Encoder

proj

The final projection layer to generate output.

Type: torch.nn.Conv1d

global_conv

Convolutional layer for global conditioning if global_channels > 0.

Type: torch.nn.Conv1d, optional
Parameters:
- out_channels (int) – The output dimension of the module.
- attention_dim (int) – The dimension of the attention mechanism.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of units in the linear layer.
- blocks (int) – The number of encoder blocks.
- pw_layer_type (str) – The type of position-wise layer to use.
- pw_conv_kernel_size (int) – The kernel size of the position-wise convolutional layer.
- pos_enc_layer_type (str) – The type of positional encoding layer to use.
- self_attention_layer_type (str) – The type of self-attention layer to use.
- activation_type (str) – The type of activation function to use.
- normalize_before (bool) – Whether to normalize the data before the position-wise layer or after.
- use_macaron_style (bool) – Whether to use the macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer style conv or not.
- conformer_kernel_size (int) – The kernel size of the conformer convolutional layer.
- dropout_rate (float) – The dropout rate to use.
- positional_dropout_rate (float) – The positional dropout rate to use.
- attention_dropout_rate (float) – The attention dropout rate to use.
- global_channels (int) – The number of channels to use for global conditioning.
Returns: None

####### Examples

decoder = Decoder(out_channels=256, attention_dim=128) output, mask = decoder(input_tensor, input_lengths, global_tensor)

NOTE

The Decoder is designed to work in conjunction with the Encoder and other components of the VISinger 2 architecture.

Initialize Decoder in VISinger 2.

Parameters:
- out_channels (int) – The output dimension of the module.
- attention_dim (int) – The dimension of the attention mechanism.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of units in the linear layer.
- blocks (int) – The number of encoder blocks.
- pw_layer_type (str) – The type of position-wise layer to use.
- pw_conv_kernel_size (int) – The kernel size of the position-wise convolutional layer.
- pos_enc_layer_type (str) – The type of positional encoding layer to use.
- self_attention_layer_type (str) – The type of self-attention layer to use.
- activation_type (str) – The type of activation function to use.
- normalize_before (bool) – Whether to normalize the data before the position-wise layer or after.
- use_macaron_style (bool) – Whether to use the macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer style conv or not.
- conformer_kernel_size (int) – The kernel size of the conformer convolutional layer.
- dropout_rate (float) – The dropout rate to use.
- positional_dropout_rate (float) – The positional dropout rate to use.
- attention_dropout_rate (float) – The attention dropout rate to use.
- global_channels (int) – The number of channels to use for global conditioning.

forward(x, x_lengths, g=None)

Forward pass of the Decoder.

This method processes the input tensor through the decoder and generates the output tensor and the corresponding output mask. It handles both regular and global conditioning inputs.

Parameters:
- x (Tensor) – Input tensor of shape (B, 2 + attention_dim, T).
- x_lengths (Tensor) – Length tensor of shape (B,).
- g (Tensor , optional) – Global conditioning tensor of shape (B, global_channels, 1).
Returns: A tuple containing: : - Output tensor of shape (B, 1, T).
- Output mask of shape (B, 1, T).
Return type: Tuple[Tensor, Tensor]

####### Examples

>>> decoder = Decoder()
>>> x = torch.randn(8, 194, 100)  # Example input
>>> x_lengths = torch.tensor([100] * 8)  # All sequences are of length 100
>>> output, mask = decoder(x, x_lengths)

NOTE

The input tensor ‘x’ must include the attention dimension plus 2 additional channels. The optional global conditioning tensor ‘g’ can be used to enhance the output if provided.