espnet2.tts.transformer.transformer.Transformer

About 7 min

espnet2.tts.transformer.transformer.Transformer

class espnet2.tts.transformer.transformer.Transformer(idim: int, odim: int, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, dprenet_layers: int = 2, dprenet_units: int = 256, elayers: int = 6, eunits: int = 1024, adim: int = 512, aheads: int = 4, dlayers: int = 6, dunits: int = 1024, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, transformer_enc_dec_attn_dropout_rate: float = 0.1, eprenet_dropout_rate: float = 0.5, dprenet_dropout_rate: float = 0.5, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, num_heads_applied_guided_attn: int = 2, num_layers_applied_guided_attn: int = 2, modules_applied_guided_attn: Sequence[str] = 'encoder-decoder', guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)

Bases: AbsTTS

Transformer-TTS module.

This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which converts the sequence of tokens into the sequence of Mel-filterbanks.

idim

Dimension of the inputs.

Type: int

odim

Dimension of the outputs.

Type: int

eos

End of sequence token ID.

Type: int

reduction_factor

Reduction factor for the output sequence.

Type: int

use_gst

Whether to use global style token.

Type: bool

use_guided_attn_loss

Whether to use guided attention loss.

Type: bool

use_scaled_pos_enc

Whether to use trainable scaled positional encoding.

Type: bool

loss_type

Type of loss function used in training.

Type: str

padding_idx

Index used for padding in sequences.

Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of character embedding.
- eprenet_conv_layers (int) – Number of encoder prenet convolution layers.
- eprenet_conv_chans (int) – Number of encoder prenet convolution channels.
- eprenet_conv_filts (int) – Filter size of encoder prenet convolution.
- dprenet_layers (int) – Number of decoder prenet layers.
- dprenet_units (int) – Number of decoder prenet hidden units.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- adim (int) – Number of attention transformation dimensions.
- aheads (int) – Number of heads for multi-head attention.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Filter size of postnet.
- positionwise_layer_type (str) – Position-wise operation type.
- positionwise_conv_kernel_size (int) – Kernel size in position-wise conv 1d.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers.
- langs (Optional *[*int ]) – Number of languages.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- use_gst (bool) – Whether to use global style token.
- gst_tokens (int) – Number of GST embeddings.
- gst_heads (int) – Number of heads in GST multihead attention.
- gst_conv_layers (int) – Number of conv layers in GST.
- gst_conv_chans_list (Sequence *[*int ]) – List of the number of channels in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – Number of GRU layers in GST.
- gst_gru_units (int) – Number of GRU units in GST.
- transformer_enc_dropout_rate (float) – Dropout rate in encoder.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float) – Dropout rate in source attention module.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- eprenet_dropout_rate (float) – Dropout rate in encoder prenet.
- dprenet_dropout_rate (float) – Dropout rate in decoder prenet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float) – Positive sample weight in BCE calculation.
- loss_type (str) – How to calculate loss.
- use_guided_attn_loss (bool) – Whether to use guided attention loss.
- num_heads_applied_guided_attn (int) – Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int) – Number of layers to apply guided attention loss.
- modules_applied_guided_attn (Sequence *[*str ]) – List of module names to apply guided attention loss.
- guided_attn_loss_sigma (float) – Sigma in guided attention loss.
- guided_attn_loss_lambda (float) – Lambda in guided attention loss.
Returns: None

######### Examples

Create a Transformer instance

transformer = Transformer(idim=256, odim=80)

Forward pass

loss, stats, weight = transformer.forward(text_tensor, text_lengths, feats_tensor, feats_lengths)

Inference

output = transformer.inference(text_tensor)

Initialize Transformer module.

Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of character embedding.
- eprenet_conv_layers (int) – Number of encoder prenet convolution layers.
- eprenet_conv_chans (int) – Number of encoder prenet convolution channels.
- eprenet_conv_filts (int) – Filter size of encoder prenet convolution.
- dprenet_layers (int) – Number of decoder prenet layers.
- dprenet_units (int) – Number of decoder prenet hidden units.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- adim (int) – Number of attention transformation dimensions.
- aheads (int) – Number of heads for multi head attention.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Filter size of postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- positionwise_layer_type (str) – Position-wise operation type.
- positionwise_conv_kernel_size (int) – Kernel size in position wise conv 1d.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- use_gst (str) – Whether to use global style token.
- gst_tokens (int) – Number of GST embeddings.
- gst_heads (int) – Number of heads in GST multihead attention.
- gst_conv_layers (int) – Number of conv layers in GST.
- gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – Number of GRU layers in GST.
- gst_gru_units (int) – Number of GRU units in GST.
- transformer_lr (float) – Initial value of learning rate.
- transformer_warmup_steps (int) – Optimizer warmup steps.
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float) – Dropout rate in source attention module.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- eprenet_dropout_rate (float) – Dropout rate in encoder prenet.
- dprenet_dropout_rate (float) – Dropout rate in decoder prenet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float) – Positive sample weight in bce calculation (only for use_masking=true).
- loss_type (str) – How to calculate loss.
- use_guided_attn_loss (bool) – Whether to use guided attention loss.
- num_heads_applied_guided_attn (int) – Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int) – Number of layers to apply guided attention loss.
- modules_applied_guided_attn (Sequence *[*str ]) – List of module names to apply guided attention loss.
- guided_attn_loss_sigma (float)
- guided_attn_loss_lambda (float) – Lambda in guided attention loss.

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Calculate forward propagation.

This method performs the forward pass of the Transformer model. It takes input text, target features, and optional speaker and language embeddings to generate predictions and calculate the loss.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
Returns:
- Loss scalar value.
- Statistics to be monitored.
- Weight value if not joint training else model outputs.
Return type: Tuple[Tensor, Dict[str, torch.Tensor], Tensor]

######### Examples

>>> text = torch.tensor([[1, 2, 3], [1, 2, 0]])
>>> text_lengths = torch.tensor([3, 2])
>>> feats = torch.rand(2, 5, 80)
>>> feats_lengths = torch.tensor([5, 5])
>>> model.forward(text, text_lengths, feats, feats_lengths)

inference(text: Tensor, feats: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False) → Dict[str, Tensor]

Generate the sequence of features given the sequences of characters.

This method performs inference using the provided input text and optional features. It generates the output features by passing the input through the encoder and decoder, with options for teacher forcing and length constraints.

Parameters:
- text (LongTensor) – Input sequence of characters (T_text,).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style embedding (T_feats’, idim).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference to determine stop probabilities.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Source attention weight (#layers, #heads, T_feats, T_text).
Return type: Dict[str, Tensor]

######### Examples

>>> text = torch.tensor([1, 2, 3, 4])  # Example character input
>>> feats = torch.rand(5, 80)  # Example feature input
>>> output = model.inference(text, feats)
>>> print(output['feat_gen'].shape)
torch.Size([T_feats, odim])  # Shape of generated features