espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing
espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing
class espnet2.svs.xiaoice.XiaoiceSing.XiaoiceSing(idim: int, odim: int, midi_dim: int = 129, duration_dim: int = 500, adim: int = 384, aheads: int = 4, elayers: int = 6, eunits: int = 1536, dlayers: int = 6, dunits: int = 1536, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, postnet_dropout_rate: float = 0.5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, encoder_type: str = 'transformer', decoder_type: str = 'transformer', transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False, loss_function: str = 'XiaoiceSing2', loss_type: str = 'L1', lambda_mel: float = 1, lambda_dur: float = 0.1, lambda_pitch: float = 0.01, lambda_vuv: float = 0.01)
Bases: AbsSVS
XiaoiceSing module for Singing Voice Synthesis.
This module implements a high-quality singing voice synthesis system, utilizing an integrated network for spectrum, F0, and duration modeling. It follows the main architecture of FastSpeech while incorporating several singing-specific design features:
- Incorporation of musical score features (e.g., note pitch and length).
- Residual connections in F0 prediction to mitigate off-key issues.
- Accumulation of phoneme durations within a musical note to compute the syllable duration loss, enhancing rhythm (syllable loss).
For more information, refer to the paper: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System: https://arxiv.org/pdf/2006.06261.pdf
idim
Dimension of the label inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
midi_dim
Dimension of the midi inputs.
- Type: int
duration_dim
Dimension of the duration inputs.
- Type: int
eos
End-of-sequence token index.
- Type: int
reduction_factor
Reduction factor for the model outputs.
- Type: int
encoder_type
Type of encoder (“transformer” or “conformer”).
- Type: str
decoder_type
Type of decoder (“transformer” or “conformer”).
- Type: str
use_scaled_pos_enc
Flag indicating whether to use scaled pos encoding.
- Type: bool
loss_function
Selected loss function (“FastSpeech1” or “XiaoiceSing2”).
- Type: str
loss_type
Type of mel loss (“L1”, “L2”, or “L1+L2”).
- Type: str
lambda_mel
Scaling coefficient for Mel loss.
- Type: float
lambda_dur
Scaling coefficient for duration loss.
- Type: float
lambda_pitch
Scaling coefficient for pitch loss.
- Type: float
lambda_vuv
Scaling coefficient for VUV loss.
Type: float
Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- duration_dim (int) – Dimension of the duration inputs.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Kernel size of postnet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention input and output in decoder.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
- reduction_factor (int) – Reduction factor.
- encoder_type (str) – Encoder type (“transformer” or “conformer”).
- decoder_type (str) – Decoder type (“transformer” or “conformer”).
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- spks (Optional *[*int ]) – Number of speakers (if > 1, use sid embedding).
- langs (Optional *[*int ]) – Number of languages (if > 1, use lid embedding).
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension (if > 0, use spembs).
- spk_embed_integration_type (str) – Method to integrate speaker embedding.
- init_type (str) – Parameter initialization method.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding for encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding for decoder.
- use_masking (bool) – Whether to apply masking for padded parts in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_function (str) – Loss functions (“FastSpeech1” or “XiaoiceSing2”).
- loss_type (str) – Mel loss type (“L1”, “L2”, or “L1+L2”).
- lambda_mel (float) – Loss scaling coefficient for Mel loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_vuv (float) – Loss scaling coefficient for VUV loss.
######### Examples
Initialize the XiaoiceSing module
xiaoice_sing = XiaoiceSing(idim=40, odim=80)
Perform forward propagation
loss, stats, output = xiaoice_sing.forward(
text=text_tensor, text_lengths=text_lengths_tensor, feats=features_tensor, feats_lengths=features_lengths_tensor, label=label_dict, label_lengths=label_lengths_dict, melody=melody_dict, melody_lengths=melody_lengths_dict, pitch=pitch_tensor, pitch_lengths=pitch_lengths_tensor, duration=duration_dict, duration_lengths=duration_lengths_dict, slur=slur_tensor, slur_lengths=slur_lengths_tensor, spembs=speaker_embeddings_tensor, sids=speaker_ids_tensor, lids=language_ids_tensor, joint_training=False,
)
Initialize XiaoiceSing module.
- Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- duration_dim (int) – Dimension of the duration inputs.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- dlayers (int) – Number of decoder layers.
- dunits (int) – Number of decoder hidden units.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet channels.
- postnet_filts (int) – Kernel size of postnet.
- postnet_dropout_rate (float) – Dropout rate in postnet.
- use_scaled_pos_enc (bool) – Whether to use trainable scaled pos encoding.
- use_batch_norm (bool) – Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool) – Whether to apply layernorm layer before encoder block.
- decoder_normalize_before (bool) – Whether to apply layernorm layer before decoder block.
- encoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in encoder.
- decoder_concat_after (bool) – Whether to concatenate attention layer’s input and output in decoder.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
- reduction_factor (int) – Reduction factor.
- encoder_type (str) – Encoder type (“transformer” or “conformer”).
- decoder_type (str) – Decoder type (“transformer” or “conformer”).
- transformer_enc_dropout_rate (float) – Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float) – Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float) – Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float) – Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float) – Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float) – Dropout rate in decoder self-attention module.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type – How to integrate speaker embedding.
- init_type (str) – How to initialize transformer parameters.
- init_enc_alpha (float) – Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float) – Initial value of alpha in scaled pos encoding of the decoder.
- use_masking (bool) – Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_function (str) – Loss functions (“FastSpeech1” or “XiaoiceSing2”)
- loss_type (str) – Mel loss type (“L1” (MAE), “L2” (MSE) or “L1+L2”)
- lambda_mel (float) – Loss scaling coefficient for Mel loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_vuv (float) – Loss scaling coefficient for VUV loss.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
This method performs forward propagation through the XiaoiceSing model. It processes the input text, features, and various optional parameters to compute the output and loss values.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, T_text).
- text_lengths (LongTensor) – Batch of lengths of each input (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B,).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B,).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B,).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- duration_length (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B,).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- flag_IsValid (bool) – Flag indicating if validation is being performed.
- Returns:
- Loss scalar value.
- Statistics to be monitored.
- Weight value if not joint training else model outputs.
- Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]
######### Examples
>>> text = torch.tensor([[1, 2, 3], [4, 5, 0]]) # Example text
>>> text_lengths = torch.tensor([3, 2]) # Lengths of each text
>>> feats = torch.rand(2, 10, 80) # Example features
>>> feats_lengths = torch.tensor([10, 10]) # Lengths of features
>>> output = model.forward(text, text_lengths, feats, feats_lengths)
>>> print(output)
inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, use_teacher_forcing: Tensor = False, joint_training: bool = False) → Dict[str, Tensor]
Generate the sequence of features given the sequences of characters.
This method processes the input text and generates the corresponding features using the trained model. It can also utilize additional inputs like melodies, pitch, and duration if provided. The output includes the generated features and the duration sequence.
- Parameters:
- text (LongTensor) – Input sequence of characters (T_text,).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style (N, idim).
- durations (Optional *[*LongTensor ]) – Groundtruth of duration (T_text + 1,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- use_teacher_forcing (torch.Tensor) – Flag to use teacher forcing.
- joint_training (bool) – Whether to perform joint training with vocoder.
- Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- duration (Tensor): Duration sequence (T_text + 1,).
- Return type: Dict[str, Tensor]
######### Examples
>>> model = XiaoiceSing(idim=128, odim=80)
>>> text = torch.LongTensor([1, 2, 3, 4])
>>> output = model.inference(text)
>>> output['feat_gen'].shape
torch.Size([T_feats, odim])