espnet2.gan_svs.vits.text_encoder.TextEncoder

About 3 min

espnet2.gan_svs.vits.text_encoder.TextEncoder

class espnet2.gan_svs.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, use_slur=True)

Bases: Module

TextEncoder class for encoding text input in the VISinger model.

This module implements a text encoder as described in the paper Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. It utilizes a conformer architecture, which incorporates convolutional layers in addition to the standard attention mechanism.

attention_dim

The dimension of attention.

Type: int

encoder

The conformer-based encoder module.

Type:Encoder

emb_phone_dim

The dimension of phone embeddings.

Type: int

emb_phone

Embedding layer for phone inputs.

Type: torch.nn.Embedding

emb_pitch_dim

The dimension of pitch embeddings.

Type: int

emb_pitch

Embedding layer for pitch inputs.

Type: torch.nn.Embedding

emb_slur

Embedding layer for slur inputs.

Type: Optional[torch.nn.Embedding]

emb_dur

Linear layer for duration inputs.

Type: torch.nn.Linear

pre_net

Preprocessing layer for the main input.

Type: torch.nn.Linear

pre_dur_net

Preprocessing layer for duration input.

Type: torch.nn.Linear

proj

Convolutional layer for projection.

Type: torch.nn.Conv1d

proj

_pitch

Convolutional layer for pitch projection.

Type: torch.nn.Conv1d
Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Dimension of the attention mechanism.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units in positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Type of positionwise layer.
- positionwise_conv_kernel_size (int) – Kernel size for positionwise layers.
- positional_encoding_layer_type (str) – Type of positional encoding layer.
- self_attention_layer_type (str) – Type of self-attention layer.
- activation_type (str) – Type of activation function.
- normalize_before (bool) – Whether to apply LayerNorm before attention.
- use_macaron_style (bool) – Whether to use Macaron-style components.
- use_conformer_conv (bool) – Whether to use convolution in conformer.
- conformer_kernel_size (int) – Kernel size for conformer convolution.
- dropout_rate (float) – Dropout rate for layers.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention layers.
- use_slur (bool) – Whether to use slur embedding.

####### Examples

>>> text_encoder = TextEncoder(vocabs=5000)
>>> phone_tensor = torch.randint(0, 5000, (32, 100))
>>> phone_lengths = torch.randint(1, 101, (32,))
>>> midi_tensor = torch.randint(0, 129, (32, 100))
>>> duration_tensor = torch.rand((32, 100))
>>> encoded_output = text_encoder(phone_tensor, phone_lengths, midi_tensor,
...                                duration_tensor)

Returns:
- Encoded hidden representation (B, attention_dim, T_text).
- Mask tensor for padded parts (B, 1, T_text).
- Encoded hidden representation for duration (B, attention_dim, T_text).
- Encoded hidden representation for pitch (B, attention_dim, T_text).
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
Raises:ValueError – If the input tensors do not match the expected dimensions.

Initialize TextEncoder module.

Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Attention dimension.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units of positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Positionwise layer type.
- positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.
- positional_encoding_layer_type (str) – Positional encoding layer type.
- self_attention_layer_type (str) – Self-attention layer type.
- activation_type (str) – Activation function type.
- normalize_before (bool) – Whether to apply LayerNorm before attention.
- use_macaron_style (bool) – Whether to use macaron style components.
- use_conformer_conv (bool) – Whether to use conformer conv layers.
- conformer_kernel_size (int) – Conformer’s conv kernel size.
- dropout_rate (float) – Dropout rate.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention.
- use_slur (bool) – Whether to use slur embedding.

forward(phone: Tensor, phone_lengths: Tensor, midi_id: Tensor, dur: Tensor, slur: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor, Tensor]

Text encoder module in VISinger.

This is a module of text encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Instead of the relative positional Transformer, we use conformer architecture as the encoder module, which contains additional convolution layers.

Parameters:
- phone (Tensor) – Input index tensor (B, T_text).
- phone_lengths (Tensor) – Length tensor (B,).
- midi_id (Tensor) – Input midi tensor (B, T_text).
- dur (Tensor) – Input duration tensor (B, T_text).

####### Examples

>>> encoder = TextEncoder(vocabs=1000)
>>> phone_tensor = torch.randint(0, 1000, (2, 10))
>>> phone_lengths = torch.tensor([10, 8])
>>> midi_id = torch.randint(0, 129, (2, 10))
>>> dur = torch.rand((2, 10))
>>> output = encoder(phone_tensor, phone_lengths, midi_id, dur)
>>> print(output)

Returns: Encoded hidden representation (B, attention_dim, T_text), Mask tensor for padded part (B, 1, T_text), Encoded hidden representation for duration (B, attention_dim, T_text), Encoded hidden representation for pitch (B, attention_dim, T_text).
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]