espnet2.svs.naive_rnn.naive_rnn_dp.NaiveRNNDP
espnet2.svs.naive_rnn.naive_rnn_dp.NaiveRNNDP
class espnet2.svs.naive_rnn.naive_rnn_dp.NaiveRNNDP(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, duration_dim: int = 500, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, duration_predictor_layers: int = 2, duration_predictor_chans: int = 384, duration_predictor_kernel_size: int = 3, duration_predictor_dropout_rate: float = 0.1, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False)
Bases: AbsSVS
NaiveRNNDP-SVS module.
This class implements a naive RNN with duration prediction for singing voice synthesis (SVS). The features are processed directly over the time domain from music scores to predict the singing voice features.
idim
Dimension of the label inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
midi_dim
Dimension of the MIDI inputs.
- Type: int
duration_dim
Dimension of the duration inputs.
- Type: int
eunits
Number of encoder hidden units.
- Type: int
reduction_factor
Reduction factor for output features.
- Type: int
midi_embed_integration_type
Method for integrating MIDI information.
Type: str
Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the MIDI inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet convolution layers.
- eprenet_conv_chans (int) – Number of prenet convolution filter channels.
- eprenet_conv_filts (int) – Number of prenet convolution filter sizes.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If True, use a bidirectional encoder.
- midi_embed_integration_type (str) – How to integrate MIDI information (“add” or “cat”).
- dlayers (int) – Number of decoder LSTM layers.
- dunits (int) – Number of decoder LSTM units.
- dbidirectional (bool) – If True, use a bidirectional decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_chans (int) – Number of postnet filter channels.
- postnet_filts (int) – Number of postnet filter sizes.
- use_batch_norm (bool) – Whether to use batch normalization.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of the duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in the duration predictor.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If > 1, assume speaker IDs will be provided.
- langs (Optional *[*int ]) – Number of languages. If > 1, assume language IDs will be provided.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If > 0, assume speaker embeddings will be provided.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout rate.
- init_type (str) – Method for initializing parameters.
- use_masking (bool) – Whether to mask padded parts in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
######### Examples
Initialize the model
model = NaiveRNNDP(idim=256, odim=80, midi_dim=129)
Forward pass with dummy data
text = torch.randint(0, 256, (8, 50)) # (batch_size, text_length) text_lengths = torch.tensor([50] * 8) # All sequences are of length 50 feats = torch.randn(8, 100, 80) # (batch_size, max_feat_length, odim) feats_lengths = torch.tensor([100] * 8) # All sequences are of length 100
loss, stats, outputs = model(text, text_lengths, feats, feats_lengths)
Inference
output = model.inference(text)
NOTE
This implementation is part of the ESPnet framework and is designed for singing voice synthesis tasks.
Initialize NaiveRNNDP module.
- Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet conv layers.
- eprenet_conv_filts (int) – Number of prenet conv filter size.
- eprenet_conv_chans (int) – Number of prenet conv filter channels.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If bidirectional in encoder.
- midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- dbidirectional (bool) – if bidirectional in decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- use_batch_norm (bool) – Whether to use batch normalization.
- reduction_factor (int) – Reduction factor.
- duration_predictor_layers (int) – Number of duration predictor layers.
- duration_predictor_chans (int) – Number of duration predictor channels.
- duration_predictor_kernel_size (int) – Kernel size of duration predictor.
- duration_predictor_dropout_rate (float) – Dropout rate in duration predictor.
- related ( # extra embedding)
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout_rate.
- init_type (str) – How to initialize transformer parameters.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
This method performs the forward pass of the NaiveRNNDP model, taking the input text, feature data, and various optional inputs to compute the model’s outputs and loss values. The forward pass includes encoding of inputs, duration prediction, length regulation, and final output generation.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B, ).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- pitch_lengths (LongTensor) – Batch of the lengths of padded f0 (B, ).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- duration_length (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of the lengths of padded duration (B, ).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B, ).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- flag_IsValid (bool) – Flag indicating whether it’s validation stage.
- Returns:
- Tensor: Loss scalar value.
- Dict: Statistics to be monitored.
- Tensor: Weight value if not joint training else model outputs.
- Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]
GS Fix: : Arguments from forward function vs. batch from espnet_model.py label == durations | phone sequence melody -> pitch sequence
######### Examples
>>> text = torch.randint(0, 100, (32, 50)) # 32 batches, 50 max length
>>> text_lengths = torch.randint(1, 51, (32,))
>>> feats = torch.randn(32, 100, 80) # 32 batches, 100 max length, 80 features
>>> feats_lengths = torch.randint(1, 101, (32,))
>>> outputs = model.forward(text, text_lengths, feats, feats_lengths)
inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, joint_training: bool = False, use_teacher_forcing: Tensor = False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation for inference.
This method processes the input data to generate singing voice features based on the provided text and other parameters. It uses the trained model to produce outputs in an inference mode.
- Parameters:
- text (LongTensor) – Batch of padded character ids (Tmax).
- feats (Tensor) – Batch of padded target features (Lmax, odim).
- label (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (Tmax).
- duration (Optional *[*Dict ]) – Key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (1).
- joint_training (bool) – Whether to perform joint training with vocoder.
- use_teacher_forcing (torch.Tensor) – Flag indicating if teacher forcing is used during inference.
- Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- Return type: Dict[str, Tensor]
######### Examples
>>> text = torch.tensor([[1, 2, 3], [1, 2, 0]])
>>> feats = torch.rand(5, 256)
>>> output = model.inference(text, feats)
>>> print(output['feat_gen'].shape)
torch.Size([5, odim])