espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing

About 8 min

espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing

class espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, postnet_dropout_rate: float = 0.5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, loss_function: str = 'XiaoiceSing2', loss_type: str = 'L1', lambda_mel: float = 1, lambda_dur: float = 0.1, lambda_pitch: float = 0.01, lambda_vuv: float = 0.01)

Bases: AbsSVS

XiaoiceSing module for Singing Voice Synthesis.

This module implements a high-quality singing voice synthesis system, utilizing an integrated network for spectrum, F0, and duration modeling. It follows the main architecture of FastSpeech while incorporating several singing-specific design features:

Incorporation of musical score features (e.g., note pitch and length).
Residual connections in F0 prediction to mitigate off-key issues.
Accumulation of phoneme durations within a musical note to compute the syllable duration loss, enhancing rhythm (syllable loss).

For more information, refer to the paper: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System: https://arxiv.org/pdf/2006.06261.pdf

idim

Dimension of the label inputs.

Type: int

odim

Dimension of the outputs.

Type: int

midi_dim

Dimension of the midi inputs.

Type: int

duration_dim

Dimension of the duration inputs.

Type: int

eos

End-of-sequence token index.

Type: int

reduction_factor

Reduction factor for the model outputs.

Type: int

encoder_type

Type of encoder (“transformer” or “conformer”).

Type: str

decoder_type

Type of decoder (“transformer” or “conformer”).

Type: str

use_scaled_pos_enc

Flag indicating whether to use scaled pos encoding.

Type: bool

loss_function

Selected loss function (“FastSpeech1” or “XiaoiceSing2”).

Type: str

loss_type

Type of mel loss (“L1”, “L2”, or “L1+L2”).

Type: str

lambda_mel

Scaling coefficient for Mel loss.

Type: float

lambda_dur

Scaling coefficient for duration loss.

Type: float

lambda_pitch

Scaling coefficient for pitch loss.

Type: float

lambda_vuv

Scaling coefficient for VUV loss.

Type: float
Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- duration_dim (int) – Dimension of the duration inputs.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Kernel size of postnet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention input and output in decoder.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
- reduction_factor (int) – Reduction factor.
- encoder_type (str) – Encoder type (“transformer” or “conformer”).
- decoder_type (str) – Decoder type (“transformer” or “conformer”).
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- spks (Optional *[*int ]) – Number of speakers (if > 1, use sid embedding).
- langs (Optional *[*int ]) – Number of languages (if > 1, use lid embedding).
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension (if > 0, use spembs).
- spk_embed_integration_type (str) – Method to integrate speaker embedding.
- init_type (str) – Parameter initialization method.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding for encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding for decoder.
- use_masking (bool) – Whether to apply masking for padded parts in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_function (str) – Loss functions (“FastSpeech1” or “XiaoiceSing2”).
- loss_type (str) – Mel loss type (“L1”, “L2”, or “L1+L2”).
- lambda_mel (float) – Loss scaling coefficient for Mel loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_vuv (float) – Loss scaling coefficient for VUV loss.

######### Examples

Initialize the XiaoiceSing module

xiaoice_sing = XiaoiceSing(idim=40, odim=80)

Perform forward propagation

loss, stats, output = xiaoice_sing.forward(

text=text_tensor, text_lengths=text_lengths_tensor, feats=features_tensor, feats_lengths=features_lengths_tensor, label=label_dict, label_lengths=label_lengths_dict, melody=melody_dict, melody_lengths=melody_lengths_dict, pitch=pitch_tensor, pitch_lengths=pitch_lengths_tensor, duration=duration_dict, duration_lengths=duration_lengths_dict, slur=slur_tensor, slur_lengths=slur_lengths_tensor, spembs=speaker_embeddings_tensor, sids=speaker_ids_tensor, lids=language_ids_tensor, joint_training=False,

)

Initialize XiaoiceSing module.

Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- duration_dim (int) – Dimension of the duration inputs.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Kernel size of postnet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
- reduction_factor (int) – Reduction factor.
- encoder_type (str) – Encoder type (“transformer” or “conformer”).
- decoder_type (str) – Decoder type (“transformer” or “conformer”).
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type – How to integrate speaker embedding.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_function (str) – Loss functions (“FastSpeech1” or “XiaoiceSing2”)
- loss_type (str) – Mel loss type (“L1” (MAE), “L2” (MSE) or “L1+L2”)
- lambda_mel (float) – Loss scaling coefficient for Mel loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_vuv (float) – Loss scaling coefficient for VUV loss.

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Calculate forward propagation.

This method performs forward propagation through the XiaoiceSing model. It processes the input text, features, and various optional parameters to compute the output and loss values.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, T_text).
- text_lengths (LongTensor) – Batch of lengths of each input (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B,).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B,).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B,).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- duration_length (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B,).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- flag_IsValid (bool) – Flag indicating if validation is being performed.
Returns:
- Loss scalar value.
- Statistics to be monitored.
- Weight value if not joint training else model outputs.
Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]

######### Examples

>>> text = torch.tensor([[1, 2, 3], [4, 5, 0]])  # Example text
>>> text_lengths = torch.tensor([3, 2])  # Lengths of each text
>>> feats = torch.rand(2, 10, 80)  # Example features
>>> feats_lengths = torch.tensor([10, 10])  # Lengths of features
>>> output = model.forward(text, text_lengths, feats, feats_lengths)
>>> print(output)

Generate the sequence of features given the sequences of characters.

This method processes the input text and generates the corresponding features using the trained model. It can also utilize additional inputs like melodies, pitch, and duration if provided. The output includes the generated features and the duration sequence.

Parameters:
- text (LongTensor) – Input sequence of characters (T_text,).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style (N, idim).
- durations (Optional *[*LongTensor ]) – Groundtruth of duration (T_text + 1,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- use_teacher_forcing (torch.Tensor) – Flag to use teacher forcing.
- joint_training (bool) – Whether to perform joint training with vocoder.
Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- duration (Tensor): Duration sequence (T_text + 1,).
Return type: Dict[str, Tensor]

######### Examples

>>> model = XiaoiceSing(idim=128, odim=80)
>>> text = torch.LongTensor([1, 2, 3, 4])
>>> output = model.inference(text)
>>> output['feat_gen'].shape
torch.Size([T_feats, odim])