espnet2.tts.transformer.transformer.Transformer
espnet2.tts.transformer.transformer.Transformer
class espnet2.tts.transformer.transformer.Transformer(idim: int, odim: int, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, dprenet_layers: int = 2, dprenet_units: int = 256, elayers: int = 6, eunits: int = 1024, adim: int = 512, aheads: int = 4, dlayers: int = 6, dunits: int = 1024, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, transformer_enc_dec_attn_dropout_rate: float = 0.1, eprenet_dropout_rate: float = 0.5, dprenet_dropout_rate: float = 0.5, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, num_heads_applied_guided_attn: int = 2, num_layers_applied_guided_attn: int = 2, modules_applied_guided_attn: Sequence[str] = 'encoder-decoder', guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)
Bases: AbsTTS
Transformer-TTS module.
This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which converts the sequence of tokens into the sequence of Mel-filterbanks.
idim
Dimension of the inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
eos
End of sequence token ID.
- Type: int
reduction_factor
Reduction factor for the output sequence.
- Type: int
use_gst
Whether to use global style token.
- Type: bool
use_guided_attn_loss
Whether to use guided attention loss.
- Type: bool
use_scaled_pos_enc
Whether to use trainable scaled positional encoding.
- Type: bool
loss_type
Type of loss function used in training.
- Type: str
padding_idx
Index used for padding in sequences.
Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of character embedding.
- eprenet_conv_layers (int) – Number of encoder prenet convolution layers.
- eprenet_conv_chans (int) – Number of encoder prenet convolution channels.
- eprenet_conv_filts (int) – Filter size of encoder prenet convolution.
- dprenet_layers (int) – Number of decoder prenet layers.
- dprenet_units (int) – Number of decoder prenet hidden units.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- adim (int) – Number of attention transformation dimensions.
- aheads (int) – Number of heads for multi-head attention.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Filter size of postnet.
- positionwise_layer_type (str) – Position-wise operation type.
- positionwise_conv_kernel_size (int) – Kernel size in position-wise conv 1d.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers.
- langs (Optional *[*int ]) – Number of languages.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- use_gst (bool) – Whether to use global style token.
- gst_tokens (int) – Number of GST embeddings.
- gst_heads (int) – Number of heads in GST multihead attention.
- gst_conv_layers (int) – Number of conv layers in GST.
- gst_conv_chans_list (Sequence *[*int ]) – List of the number of channels in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – Number of GRU layers in GST.
- gst_gru_units (int) – Number of GRU units in GST.
- transformer_enc_dropout_rate (float) – Dropout rate in encoder.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float) – Dropout rate in source attention module.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- eprenet_dropout_rate (float) – Dropout rate in encoder prenet.
- dprenet_dropout_rate (float) – Dropout rate in decoder prenet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float) – Positive sample weight in BCE calculation.
- loss_type (str) – How to calculate loss.
- use_guided_attn_loss (bool) – Whether to use guided attention loss.
- num_heads_applied_guided_attn (int) – Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int) – Number of layers to apply guided attention loss.
- modules_applied_guided_attn (Sequence *[*str ]) – List of module names to apply guided attention loss.
- guided_attn_loss_sigma (float) – Sigma in guided attention loss.
- guided_attn_loss_lambda (float) – Lambda in guided attention loss.
Returns: None
######### Examples
Create a Transformer instance
transformer = Transformer(idim=256, odim=80)
Forward pass
loss, stats, weight = transformer.forward(text_tensor, text_lengths, feats_tensor, feats_lengths)
Inference
output = transformer.inference(text_tensor)
Initialize Transformer module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of character embedding.
- eprenet_conv_layers (int) – Number of encoder prenet convolution layers.
- eprenet_conv_chans (int) – Number of encoder prenet convolution channels.
- eprenet_conv_filts (int) – Filter size of encoder prenet convolution.
- dprenet_layers (int) – Number of decoder prenet layers.
- dprenet_units (int) – Number of decoder prenet hidden units.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- adim (int) – Number of attention transformation dimensions.
- aheads (int) – Number of heads for multi head attention.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Filter size of postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- positionwise_layer_type (str) – Position-wise operation type.
- positionwise_conv_kernel_size (int) – Kernel size in position wise conv 1d.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- use_gst (str) – Whether to use global style token.
- gst_tokens (int) – Number of GST embeddings.
- gst_heads (int) – Number of heads in GST multihead attention.
- gst_conv_layers (int) – Number of conv layers in GST.
- gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – Number of GRU layers in GST.
- gst_gru_units (int) – Number of GRU units in GST.
- transformer_lr (float) – Initial value of learning rate.
- transformer_warmup_steps (int) – Optimizer warmup steps.
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float) – Dropout rate in source attention module.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- eprenet_dropout_rate (float) – Dropout rate in encoder prenet.
- dprenet_dropout_rate (float) – Dropout rate in decoder prenet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float) – Positive sample weight in bce calculation (only for use_masking=true).
- loss_type (str) – How to calculate loss.
- use_guided_attn_loss (bool) – Whether to use guided attention loss.
- num_heads_applied_guided_attn (int) – Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int) – Number of layers to apply guided attention loss.
- modules_applied_guided_attn (Sequence *[*str ]) – List of module names to apply guided attention loss.
- guided_attn_loss_sigma (float)
- guided_attn_loss_lambda (float) – Lambda in guided attention loss.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
This method performs the forward pass of the Transformer model. It takes input text, target features, and optional speaker and language embeddings to generate predictions and calculate the loss.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- Returns:
- Loss scalar value.
- Statistics to be monitored.
- Weight value if not joint training else model outputs.
- Return type: Tuple[Tensor, Dict[str, torch.Tensor], Tensor]
######### Examples
>>> text = torch.tensor([[1, 2, 3], [1, 2, 0]])
>>> text_lengths = torch.tensor([3, 2])
>>> feats = torch.rand(2, 5, 80)
>>> feats_lengths = torch.tensor([5, 5])
>>> model.forward(text, text_lengths, feats, feats_lengths)
inference(text: Tensor, feats: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False) → Dict[str, Tensor]
Generate the sequence of features given the sequences of characters.
This method performs inference using the provided input text and optional features. It generates the output features by passing the input through the encoder and decoder, with options for teacher forcing and length constraints.
- Parameters:
- text (LongTensor) – Input sequence of characters (T_text,).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style embedding (T_feats’, idim).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference to determine stop probabilities.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
- Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Source attention weight (#layers, #heads, T_feats, T_text).
- Return type: Dict[str, Tensor]
######### Examples
>>> text = torch.tensor([1, 2, 3, 4]) # Example character input
>>> feats = torch.rand(5, 80) # Example feature input
>>> output = model.inference(text, feats)
>>> print(output['feat_gen'].shape)
torch.Size([T_feats, odim]) # Shape of generated features