espnet2.gan_svs.vits.text_encoder.TextEncoder
espnet2.gan_svs.vits.text_encoder.TextEncoder
class espnet2.gan_svs.vits.text_encoder.TextEncoder(vocabs: int, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 6, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, use_slur=True)
Bases: Module
TextEncoder class for encoding text input in the VISinger model.
This module implements a text encoder as described in the paper Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. It utilizes a conformer architecture, which incorporates convolutional layers in addition to the standard attention mechanism.
attention_dim
The dimension of attention.
- Type: int
encoder
The conformer-based encoder module.
- Type:Encoder
emb_phone_dim
The dimension of phone embeddings.
- Type: int
emb_phone
Embedding layer for phone inputs.
- Type: torch.nn.Embedding
emb_pitch_dim
The dimension of pitch embeddings.
- Type: int
emb_pitch
Embedding layer for pitch inputs.
- Type: torch.nn.Embedding
emb_slur
Embedding layer for slur inputs.
- Type: Optional[torch.nn.Embedding]
emb_dur
Linear layer for duration inputs.
- Type: torch.nn.Linear
pre_net
Preprocessing layer for the main input.
- Type: torch.nn.Linear
pre_dur_net
Preprocessing layer for duration input.
- Type: torch.nn.Linear
proj
Convolutional layer for projection.
- Type: torch.nn.Conv1d
proj
Convolutional layer for pitch projection.
Type: torch.nn.Conv1d
Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Dimension of the attention mechanism.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units in positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Type of positionwise layer.
- positionwise_conv_kernel_size (int) – Kernel size for positionwise layers.
- positional_encoding_layer_type (str) – Type of positional encoding layer.
- self_attention_layer_type (str) – Type of self-attention layer.
- activation_type (str) – Type of activation function.
- normalize_before (bool) – Whether to apply LayerNorm before attention.
- use_macaron_style (bool) – Whether to use Macaron-style components.
- use_conformer_conv (bool) – Whether to use convolution in conformer.
- conformer_kernel_size (int) – Kernel size for conformer convolution.
- dropout_rate (float) – Dropout rate for layers.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention layers.
- use_slur (bool) – Whether to use slur embedding.
####### Examples
>>> text_encoder = TextEncoder(vocabs=5000)
>>> phone_tensor = torch.randint(0, 5000, (32, 100))
>>> phone_lengths = torch.randint(1, 101, (32,))
>>> midi_tensor = torch.randint(0, 129, (32, 100))
>>> duration_tensor = torch.rand((32, 100))
>>> encoded_output = text_encoder(phone_tensor, phone_lengths, midi_tensor,
... duration_tensor)
- Returns:
- Encoded hidden representation (B, attention_dim, T_text).
- Mask tensor for padded parts (B, 1, T_text).
- Encoded hidden representation for duration (B, attention_dim, T_text).
- Encoded hidden representation for pitch (B, attention_dim, T_text).
- Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
- Raises:ValueError – If the input tensors do not match the expected dimensions.
Initialize TextEncoder module.
- Parameters:
- vocabs (int) – Vocabulary size.
- attention_dim (int) – Attention dimension.
- attention_heads (int) – Number of attention heads.
- linear_units (int) – Number of linear units of positionwise layers.
- blocks (int) – Number of encoder blocks.
- positionwise_layer_type (str) – Positionwise layer type.
- positionwise_conv_kernel_size (int) – Positionwise layer’s kernel size.
- positional_encoding_layer_type (str) – Positional encoding layer type.
- self_attention_layer_type (str) – Self-attention layer type.
- activation_type (str) – Activation function type.
- normalize_before (bool) – Whether to apply LayerNorm before attention.
- use_macaron_style (bool) – Whether to use macaron style components.
- use_conformer_conv (bool) – Whether to use conformer conv layers.
- conformer_kernel_size (int) – Conformer’s conv kernel size.
- dropout_rate (float) – Dropout rate.
- positional_dropout_rate (float) – Dropout rate for positional encoding.
- attention_dropout_rate (float) – Dropout rate for attention.
- use_slur (bool) – Whether to use slur embedding.
forward(phone: Tensor, phone_lengths: Tensor, midi_id: Tensor, dur: Tensor, slur: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor, Tensor]
Text encoder module in VISinger.
This is a module of text encoder described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
Instead of the relative positional Transformer, we use conformer architecture as the encoder module, which contains additional convolution layers.
- Parameters:
- phone (Tensor) – Input index tensor (B, T_text).
- phone_lengths (Tensor) – Length tensor (B,).
- midi_id (Tensor) – Input midi tensor (B, T_text).
- dur (Tensor) – Input duration tensor (B, T_text).
####### Examples
>>> encoder = TextEncoder(vocabs=1000)
>>> phone_tensor = torch.randint(0, 1000, (2, 10))
>>> phone_lengths = torch.tensor([10, 8])
>>> midi_id = torch.randint(0, 129, (2, 10))
>>> dur = torch.rand((2, 10))
>>> output = encoder(phone_tensor, phone_lengths, midi_id, dur)
>>> print(output)
- Returns: Encoded hidden representation (B, attention_dim, T_text), Mask tensor for padded part (B, 1, T_text), Encoded hidden representation for duration (B, attention_dim, T_text), Encoded hidden representation for pitch (B, attention_dim, T_text).
- Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]