espnet2.svs.naive_rnn.naive_rnn.NaiveRNN
espnet2.svs.naive_rnn.naive_rnn.NaiveRNN
class espnet2.svs.naive_rnn.naive_rnn.NaiveRNN(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')
Bases: AbsSVS
NaiveRNN-SVS module.
This is an implementation of naive RNN for singing voice synthesis. The features are processed directly over time-domain from music score and predict the singing voice features.
idim
Dimension of the label inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
midi_dim
Dimension of the midi inputs.
- Type: int
embed_dim
Dimension of the token embedding.
- Type: int
eprenet_conv_layers
Number of prenet conv layers.
- Type: int
eprenet_conv_chans
Number of prenet conv filter channels.
- Type: int
eprenet_conv_filts
Number of prenet conv filter size.
- Type: int
elayers
Number of encoder layers.
- Type: int
eunits
Number of encoder hidden units.
- Type: int
ebidirectional
If bidirectional in encoder.
- Type: bool
midi_embed_integration_type
How to integrate midi information, (“add” or “cat”).
- Type: str
dlayers
Number of decoder lstm layers.
- Type: int
dunits
Number of decoder lstm units.
- Type: int
dbidirectional
If bidirectional in decoder.
- Type: bool
postnet_layers
Number of postnet layers.
- Type: int
postnet_chans
Number of postnet filter channels.
- Type: int
postnet_filts
Number of postnet filter size.
- Type: int
use_batch_norm
Whether to use batch normalization.
- Type: bool
reduction_factor
Reduction factor.
- Type: int
spks
Number of speakers.
- Type: Optional[int]
langs
Number of languages.
- Type: Optional[int]
spk_embed_dim
Speaker embedding dimension.
- Type: Optional[int]
spk_embed_integration_type
How to integrate speaker embedding.
- Type: str
eprenet_dropout_rate
Prenet dropout rate.
- Type: float
edropout_rate
Encoder dropout rate.
- Type: float
ddropout_rate
Decoder dropout rate.
- Type: float
postnet_dropout_rate
Postnet dropout rate.
- Type: float
init_type
How to initialize transformer parameters.
- Type: str
use_masking
Whether to mask padded part in loss calculation.
- Type: bool
use_weighted_masking
Whether to apply weighted masking in loss calculation.
- Type: bool
loss_type
Loss function type (“L1”, “L2”, or “L1+L2”).
Type: str
Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet conv layers.
- eprenet_conv_filts (int) – Number of prenet conv filter size.
- eprenet_conv_chans (int) – Number of prenet conv filter channels.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If bidirectional in encoder.
- midi_embed_integration_type (str) – How to integrate midi information, (“add” or “cat”).
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- dbidirectional (bool) – If bidirectional in decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- use_batch_norm (bool) – Whether to use batch normalization.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers.
- langs (Optional *[*int ]) – Number of languages.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout rate.
- init_type (str) – How to initialize transformer parameters.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).
####### Examples
Example usage:
naive_rnn = NaiveRNN(idim=80, odim=80) text = torch.randint(0, 10, (32, 50)) text_lengths = torch.randint(1, 50, (32,)) feats = torch.randn(32, 100, 80) feats_lengths = torch.randint(1, 100, (32,)) output = naive_rnn(text, text_lengths, feats, feats_lengths)
Initialize NaiveRNN module.
- Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet conv layers.
- eprenet_conv_filts (int) – Number of prenet conv filter size.
- eprenet_conv_chans (int) – Number of prenet conv filter channels.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If bidirectional in encoder.
- midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- dbidirectional (bool) – if bidirectional in decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- use_batch_norm (bool) – Whether to use batch normalization.
- reduction_factor (int) – Reduction factor.
- related ( # extra embedding)
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout_rate.
- init_type (str) – How to initialize transformer parameters.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
This method computes the forward pass of the NaiveRNN model, taking the input text and various associated features to produce the output predictions. The forward method handles the integration of different input types, applies masking if necessary, and calculates the loss based on the predicted and target features.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B,).
- melody (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B,).
- pitch (Optional *[*FloatTensor ]) – Batch of padded f0 (B, Tmax).
- pitch_lengths (Optional *[*LongTensor ]) – Batch of the lengths of padded f0 (B,).
- duration (Optional *[*Dict ]) – Key is “lab”, “score”; value (LongTensor): Batch of padded duration (B, Tmax).
- duration_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded duration (B,).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- flag_IsValid (bool) – Flag indicating if the operation is for validation.
- Returns: A tuple containing: : - Tensor: Loss scalar value.
- Dict: Statistics to be monitored.
- Tensor: Weight value if not joint training else model outputs.
- Return type: Tuple[Tensor, Dict[str, torch.Tensor], torch.Tensor]
GS Fix: : Arguments from forward function versus batch from espnet_model.py: label == durations | phone sequence melody -> pitch sequence
####### Examples
>>> model = NaiveRNN(...)
>>> loss, stats, weight = model.forward(text, text_lengths, feats, feats_lengths)
NOTE
Ensure that the input tensors are properly padded and have matching lengths for successful execution.
inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, use_teacher_forcing: Tensor = False) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate forward propagation.
- Parameters:
- text (LongTensor) – Batch of padded character ids (Tmax).
- feats (Tensor) – Batch of padded target features (Lmax, odim).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (1).
- Returns: Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- Return type: Dict[str, Tensor]