espnet2.gan_tts.vits.text_encoder.TextEncoder

About 2 min

espnet2.gan_tts.vits.text_encoder.TextEncoder

class espnet2.gan_tts.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)

Bases: Module

Text encoder module in VITS.

This module implements a text encoder as described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. Instead of using a relative positional Transformer, it employs a conformer architecture, which incorporates additional convolution layers.

attention_dim

Dimension of the attention mechanism.

Type: int

emb

Embedding layer for input indices.

Type: torch.nn.Embedding

encoder

Encoder module implementing the conformer architecture.

Type:Encoder

proj

Convolution layer for projecting outputs.

Type: torch.nn.Conv1d
Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Attention dimension.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units in positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Type of positionwise layer.
- positionwise_conv_kernel_size (int) – Kernel size for positionwise layers.
- positional_encoding_layer_type (str) – Type of positional encoding layer.
- self_attention_layer_type (str) – Type of self-attention layer.
- activation_type (str) – Type of activation function.
- normalize_before (bool) – If True, applies LayerNorm before attention.
- use_macaron_style (bool) – If True, uses macaron style components.
- use_conformer_conv (bool) – If True, uses conformer convolution layers.
- conformer_kernel_size (int) – Kernel size for conformer convolution.
- dropout_rate (float) – Dropout rate for layers.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention layers.

####### Examples

>>> encoder = TextEncoder(vocabs=5000)
>>> input_tensor = torch.randint(0, 5000, (32, 100))  # (B, T_text)
>>> input_lengths = torch.randint(1, 101, (32,))  # (B,)
>>> outputs = encoder(input_tensor, input_lengths)
>>> encoded, mean, logs, mask = outputs

Raises:ValueError – If the input tensor or lengths are not valid.

Initialize TextEncoder module.

Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Attention dimension.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units of positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Positionwise layer type.
- positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.
- positional_encoding_layer_type (str) – Positional encoding layer type.
- self_attention_layer_type (str) – Self-attention layer type.
- activation_type (str) – Activation function type.
- normalize_before (bool) – Whether to apply LayerNorm before attention.
- use_macaron_style (bool) – Whether to use macaron style components.
- use_conformer_conv (bool) – Whether to use conformer conv layers.
- conformer_kernel_size (int) – Conformer’s conv kernel size.
- dropout_rate (float) – Dropout rate.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention.

forward(x: Tensor, x_lengths: Tensor) → Tuple[Tensor, Tensor, Tensor, Tensor]

Calculate forward propagation through the TextEncoder module.

This method takes input tensors, applies an embedding layer, and passes the result through an encoder, ultimately returning the encoded hidden representation, projected mean, projected scale, and the mask tensor.

Parameters:
- x (Tensor) – Input index tensor with shape (B, T_text), where B is the batch size and T_text is the length of the input text.
- x_lengths (Tensor) – Length tensor with shape (B,), representing the actual lengths of the sequences in the batch.
Returns: A tuple containing: : - Tensor: Encoded hidden representation with shape (B, attention_dim, T_text).
- Tensor: Projected mean tensor with shape (B, attention_dim, T_text).
- Tensor: Projected scale tensor with shape (B, attention_dim, T_text).
- Tensor: Mask tensor for input tensor with shape (B, 1, T_text).
Return type: Tuple[Tensor, Tensor, Tensor, Tensor]

####### Examples

>>> encoder = TextEncoder(vocabs=1000)
>>> input_tensor = torch.randint(0, 1000, (32, 10))  # Batch of 32, length 10
>>> input_lengths = torch.tensor([10] * 32)  # All sequences are of length 10
>>> output = encoder.forward(input_tensor, input_lengths)
>>> encoded_representation, projected_mean, projected_scale, mask = output
>>> print(encoded_representation.shape)  # Output: (32, attention_dim, 10)
>>> print(projected_mean.shape)  # Output: (32, attention_dim, 10)
>>> print(projected_scale.shape)  # Output: (32, attention_dim, 10)
>>> print(mask.shape)  # Output: (32, 1, 10)