espnet2.svs.singing_tacotron.singing_tacotron.singing_tacotron

About 8 min

espnet2.svs.singing_tacotron.singing_tacotron.singing_tacotron

class espnet2.svs.singing_tacotron.singing_tacotron.singing_tacotron(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, embed_dim: int = 512, elayers: int = 1, eunits: int = 512, econv_layers: int = 3, econv_chans: int = 512, econv_filts: int = 5, atype: str = 'GDCA', adim: int = 512, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 2, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 256, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: str | None = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'concat', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, dropout_rate: float = 0.5, zoneout_rate: float = 0.1, use_masking: bool = True, use_weighted_masking: bool = False, bce_pos_weight: float = 5.0, loss_type: str = 'L1', use_guided_attn_loss: bool = True, guided_attn_loss_sigma: float = 0.4, guided_attn_loss_lambda: float = 1.0)

Bases: AbsSVS

Singing Tacotron related modules for ESPnet2.

This module implements the Singing Tacotron, a spectrogram prediction network for end-to-end singing voice synthesis. It is described in the paper

`Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for
End-to-end Singing Voice Synthesis`_

This module learns accurate alignment information automatically, and can be configured with various hyperparameters related to the encoder, decoder, and training settings.

End-to-end Singing Voice Synthesis`: : https://arxiv.org/pdf/2202.07907v1.pdf

idim

Dimension of the label inputs.

Type: int

odim

Dimension of the outputs.

Type: int

eos

End-of-sequence token index.

Type: int

midi_eos

End-of-sequence token index for MIDI.

Type: int

duration_eos

End-of-sequence token index for duration.

Type: int

cumulate_att_w

Whether to cumulate previous attention weight.

Type: bool

reduction_factor

Reduction factor.

Type: int

use_gst

Whether to use global style token.

Type: bool

use_guided_attn_loss

Whether to use guided attention loss.

Type: bool

loss_type

Type of loss function to use (“L1”, “L2”, or “L1+L2”).

Type: str
Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int , optional) – Dimension of MIDI inputs. Defaults to 129.
- duration_dim (int , optional) – Dimension of duration inputs. Defaults to 500.
- embed_dim (int , optional) – Dimension of the token embedding. Defaults to 512.
- elayers (int , optional) – Number of encoder BLSTM layers. Defaults to 1.
- eunits (int , optional) – Number of encoder BLSTM units. Defaults to 512.
- econv_layers (int , optional) – Number of encoder convolution layers. Defaults to 3.
- econv_chans (int , optional) – Number of encoder convolution filter channels. Defaults to 512.
- econv_filts (int , optional) – Size of encoder convolution filters. Defaults to 5.
- atype (str , optional) – Type of attention mechanism to use. Defaults to “GDCA”.
- adim (int , optional) – Dimension of MLP in attention. Defaults to 512.
- aconv_chans (int , optional) – Number of attention convolution filter channels. Defaults to 32.
- aconv_filts (int , optional) – Size of attention convolution filters. Defaults to 15.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight. Defaults to True.
- dlayers (int , optional) – Number of decoder LSTM layers. Defaults to 2.
- dunits (int , optional) – Number of decoder LSTM units. Defaults to 1024.
- prenet_layers (int , optional) – Number of prenet layers. Defaults to 2.
- prenet_units (int , optional) – Number of prenet units. Defaults to 256.
- postnet_layers (int , optional) – Number of postnet layers. Defaults to 5.
- postnet_chans (int , optional) – Number of postnet filter channels. Defaults to 512.
- postnet_filts (int , optional) – Size of postnet filters. Defaults to 5.
- output_activation (Optional *[*str ] , optional) – Name of activation function for outputs. Defaults to None.
- use_batch_norm (bool , optional) – Whether to use batch normalization. Defaults to True.
- use_concate (bool , optional) – Whether to concatenate encoder outputs with decoder outputs. Defaults to True.
- use_residual (bool , optional) – Whether to use residual connections. Defaults to False.
- reduction_factor (int , optional) – Reduction factor. Defaults to 1.
- spks (Optional *[*int ] , optional) – Number of speakers. Defaults to None.
- langs (Optional *[*int ] , optional) – Number of languages. Defaults to None.
- spk_embed_dim (Optional *[*int ] , optional) – Speaker embedding dimension. Defaults to None.
- spk_embed_integration_type (str , optional) – Method to integrate speaker embedding. Defaults to “concat”.
- use_gst (bool , optional) – Whether to use global style token. Defaults to False.
- gst_tokens (int , optional) – Number of GST embeddings. Defaults to 10.
- gst_heads (int , optional) – Number of heads in GST multihead attention. Defaults to 4.
- gst_conv_layers (int , optional) – Number of conv layers in GST. Defaults to 6.
- gst_conv_chans_list (Sequence *[*int ] , optional) – List of channels for conv layers in GST. Defaults to (32, 32, 64, 64, 128, 128).
- gst_conv_kernel_size (int , optional) – Kernel size of conv layers in GST. Defaults to 3.
- gst_conv_stride (int , optional) – Stride size of conv layers in GST. Defaults to 2.
- gst_gru_layers (int , optional) – Number of GRU layers in GST. Defaults to 1.
- gst_gru_units (int , optional) – Number of GRU units in GST. Defaults to 128.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.5.
- zoneout_rate (float , optional) – Zoneout rate. Defaults to 0.1.
- use_masking (bool , optional) – Whether to mask padded parts in loss calculation. Defaults to True.
- use_weighted_masking (bool , optional) – Whether to apply weighted masking in loss calculation. Defaults to False.
- bce_pos_weight (float , optional) – Weight of positive sample of stop token (only for use_masking=True). Defaults to 5.0.
- loss_type (str , optional) – Loss function type (“L1”, “L2”, or “L1+L2”). Defaults to “L1”.
- use_guided_attn_loss (bool , optional) – Whether to use guided attention loss. Defaults to True.
- guided_attn_loss_sigma (float , optional) – Sigma in guided attention loss. Defaults to 0.4.
- guided_attn_loss_lambda (float , optional) – Lambda in guided attention loss. Defaults to 1.0.

####### Examples

>>> model = singing_tacotron(idim=40, odim=80)
>>> text = torch.randint(0, 40, (1, 50))  # Random text input
>>> text_lengths = torch.tensor([50])  # Length of text input
>>> feats = torch.randn(1, 80, 200)  # Random feature input
>>> feats_lengths = torch.tensor([200])  # Length of feature input
>>> output = model.forward(text, text_lengths, feats, feats_lengths)

Initialize Singing Tacotron module.

Parameters:
- idim (int) – Dimension of the label inputs.
- odim – (int) Dimension of the outputs.
- embed_dim (int) – Dimension of the token embedding.
- elayers (int) – Number of encoder blstm layers.
- eunits (int) – Number of encoder blstm units.
- econv_layers (int) – Number of encoder conv layers.
- econv_filts (int) – Number of encoder conv filter size.
- econv_chans (int) – Number of encoder conv filter channels.
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- prenet_layers (int) – Number of prenet layers.
- prenet_units (int) – Number of prenet units.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- output_activation (str) – Name of activation function for outputs.
- adim (int) – Number of dimension of mlp in attention.
- aconv_chans (int) – Number of attention conv filter channels.
- aconv_filts (int) – Number of attention conv filter size.
- cumulate_att_w (bool) – Whether to cumulate previous attention weight.
- use_batch_norm (bool) – Whether to use batch normalization.
- use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- use_gst (str) – Whether to use global style token.
- gst_tokens (int) – Number of GST embeddings.
- gst_heads (int) – Number of heads in GST multihead attention.
- gst_conv_layers (int) – Number of conv layers in GST.
- gst_conv_chans_list – (Sequence[int]): List of the number of channels of conv layers in GST.
- gst_conv_kernel_size (int) – Kernel size of conv layers in GST.
- gst_conv_stride (int) – Stride size of conv layers in GST.
- gst_gru_layers (int) – Number of GRU layers in GST.
- gst_gru_units (int) – Number of GRU units in GST.
- dropout_rate (float) – Dropout rate.
- zoneout_rate (float) – Zoneout rate.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float) – Weight of positive sample of stop token (only for use_masking=True).
- loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).
- use_guided_attn_loss (bool) – Whether to use guided attention loss.
- guided_attn_loss_sigma (float) – Sigma in guided attention loss.
- guided_attn_loss_lambda (float) – Lambda in guided attention loss.

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, ying: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Calculate forward propagation.

This method performs forward propagation through the Singing Tacotron model, processing the input sequences and generating the corresponding output features along with the associated statistics.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, T_text).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – Key is “lab” or “score”; value is (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value is (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – Key is “lab” or “score”; value is (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value is (LongTensor): Batch of the lengths of padded melody (B, ).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
- duration (Optional *[*Dict ]) – Key is “lab”, “score_phn” or “score_syb”; value is (LongTensor): Batch of padded duration (B, Tmax).
- duration_lengths (Optional *[*Dict ]) – Key is “lab”, “score_phn” or “score_syb”; value is (LongTensor): Batch of the lengths of padded duration (B, ).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- flag_IsValid (bool) – A flag indicating whether the input is valid.
Returns: Loss scalar value. Dict: Statistics to be monitored. Tensor: Weight value if not joint training else model outputs.
Return type: Tensor

####### Examples

>>> text = torch.randint(0, 100, (32, 50))  # Batch of text inputs
>>> text_lengths = torch.randint(10, 50, (32,))  # Lengths of texts
>>> feats = torch.rand(32, 100, 80)  # Batch of feature outputs
>>> feats_lengths = torch.randint(80, 100, (32,))  # Lengths of feats
>>> loss, stats, weight = model.forward(text, text_lengths, feats,
...                                      feats_lengths)

inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 30.0, use_att_constraint: bool = False, use_dynamic_filter: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, Tensor]

Generate the sequence of features given the sequences of characters.

Parameters:
- text (LongTensor) – Input sequence of characters (T_text,).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style (N, idim).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_att_constraint (bool) – Whether to apply attention constraint.
- use_dynamic_filter (bool) – Whether to apply dynamic filter.
- backward_window (int) – Backward window in attention constraint or dynamic filter.
- forward_window (int) – Forward window in attention constraint or dynamic filter.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Attention weights (T_feats, T).
Return type: Dict[str, Tensor]