espnet2.svs.naive_rnn.naive_rnn.NaiveRNN

About 6 min

espnet2.svs.naive_rnn.naive_rnn.NaiveRNN

class espnet2.svs.naive_rnn.naive_rnn.NaiveRNN(idim: int, odim: int, midi_dim: int = 129, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, elayers: int = 3, eunits: int = 1024, ebidirectional: bool = True, midi_embed_integration_type: str = 'add', dlayers: int = 3, dunits: int = 1024, dbidirectional: bool = True, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, use_batch_norm: bool = True, reduction_factor: int = 1, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', eprenet_dropout_rate: float = 0.5, edropout_rate: float = 0.1, ddropout_rate: float = 0.1, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', use_masking: bool = False, use_weighted_masking: bool = False, loss_type: str = 'L1')

Bases: AbsSVS

NaiveRNN-SVS module.

This is an implementation of naive RNN for singing voice synthesis. The features are processed directly over time-domain from music score and predict the singing voice features.

idim

Dimension of the label inputs.

Type: int

odim

Dimension of the outputs.

Type: int

midi_dim

Dimension of the midi inputs.

Type: int

embed_dim

Dimension of the token embedding.

Type: int

eprenet_conv_layers

Number of prenet conv layers.

Type: int

eprenet_conv_chans

Number of prenet conv filter channels.

Type: int

eprenet_conv_filts

Number of prenet conv filter size.

Type: int

elayers

Number of encoder layers.

Type: int

eunits

Number of encoder hidden units.

Type: int

ebidirectional

If bidirectional in encoder.

Type: bool

midi_embed_integration_type

How to integrate midi information, (“add” or “cat”).

Type: str

dlayers

Number of decoder lstm layers.

Type: int

dunits

Number of decoder lstm units.

Type: int

dbidirectional

If bidirectional in decoder.

Type: bool

postnet_layers

Number of postnet layers.

Type: int

postnet_chans

Number of postnet filter channels.

Type: int

postnet_filts

Number of postnet filter size.

Type: int

use_batch_norm

Whether to use batch normalization.

Type: bool

reduction_factor

Reduction factor.

Type: int

spks

Number of speakers.

Type: Optional[int]

langs

Number of languages.

Type: Optional[int]

spk_embed_dim

Speaker embedding dimension.

Type: Optional[int]

spk_embed_integration_type

How to integrate speaker embedding.

Type: str

eprenet_dropout_rate

Prenet dropout rate.

Type: float

edropout_rate

Encoder dropout rate.

Type: float

ddropout_rate

Decoder dropout rate.

Type: float

postnet_dropout_rate

Postnet dropout rate.

Type: float

init_type

How to initialize transformer parameters.

Type: str

use_masking

Whether to mask padded part in loss calculation.

Type: bool

use_weighted_masking

Whether to apply weighted masking in loss calculation.

Type: bool

loss_type

Loss function type (“L1”, “L2”, or “L1+L2”).

Type: str
Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet conv layers.
- eprenet_conv_filts (int) – Number of prenet conv filter size.
- eprenet_conv_chans (int) – Number of prenet conv filter channels.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If bidirectional in encoder.
- midi_embed_integration_type (str) – How to integrate midi information, (“add” or “cat”).
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- dbidirectional (bool) – If bidirectional in decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- use_batch_norm (bool) – Whether to use batch normalization.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers.
- langs (Optional *[*int ]) – Number of languages.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout rate.
- init_type (str) – How to initialize transformer parameters.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).

####### Examples

Example usage:

naive_rnn = NaiveRNN(idim=80, odim=80) text = torch.randint(0, 10, (32, 50)) text_lengths = torch.randint(1, 50, (32,)) feats = torch.randn(32, 100, 80) feats_lengths = torch.randint(1, 100, (32,)) output = naive_rnn(text, text_lengths, feats, feats_lengths)

Initialize NaiveRNN module.

Parameters:
- idim (int) – Dimension of the label inputs.
- odim (int) – Dimension of the outputs.
- midi_dim (int) – Dimension of the midi inputs.
- embed_dim (int) – Dimension of the token embedding.
- eprenet_conv_layers (int) – Number of prenet conv layers.
- eprenet_conv_filts (int) – Number of prenet conv filter size.
- eprenet_conv_chans (int) – Number of prenet conv filter channels.
- elayers (int) – Number of encoder layers.
- eunits (int) – Number of encoder hidden units.
- ebidirectional (bool) – If bidirectional in encoder.
- midi_embed_integration_type (str) – how to integrate midi information, (“add” or “cat”).
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- dbidirectional (bool) – if bidirectional in decoder.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- use_batch_norm (bool) – Whether to use batch normalization.
- reduction_factor (int) – Reduction factor.
- related ( # extra embedding)
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- eprenet_dropout_rate (float) – Prenet dropout rate.
- edropout_rate (float) – Encoder dropout rate.
- ddropout_rate (float) – Decoder dropout rate.
- postnet_dropout_rate (float) – Postnet dropout_rate.
- init_type (str) – How to initialize transformer parameters.
- use_masking (bool) – Whether to mask padded part in loss calculation.
- use_weighted_masking (bool) – Whether to apply weighted masking in loss calculation.
- loss_type (str) – Loss function type (“L1”, “L2”, or “L1+L2”).

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, melody_lengths: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, duration: Dict[str, Tensor] | None = None, duration_lengths: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, slur_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, flag_IsValid=False) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Calculate forward propagation.

This method computes the forward pass of the NaiveRNN model, taking the input text and various associated features to produce the output predictions. The forward method handles the integration of different input types, applies masking if necessary, and calculates the loss based on the predicted and target features.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B,).
- melody (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- melody_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded melody (B,).
- pitch (Optional *[*FloatTensor ]) – Batch of padded f0 (B, Tmax).
- pitch_lengths (Optional *[*LongTensor ]) – Batch of the lengths of padded f0 (B,).
- duration (Optional *[*Dict ]) – Key is “lab”, “score”; value (LongTensor): Batch of padded duration (B, Tmax).
- duration_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded duration (B,).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- slur_lengths (LongTensor) – Batch of the lengths of padded slur (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- flag_IsValid (bool) – Flag indicating if the operation is for validation.
Returns: A tuple containing: : - Tensor: Loss scalar value.
- Dict: Statistics to be monitored.
- Tensor: Weight value if not joint training else model outputs.
Return type: Tuple[Tensor, Dict[str, torch.Tensor], torch.Tensor]

GS Fix: : Arguments from forward function versus batch from espnet_model.py: label == durations | phone sequence melody -> pitch sequence

####### Examples

>>> model = NaiveRNN(...)
>>> loss, stats, weight = model.forward(text, text_lengths, feats, feats_lengths)

NOTE

Ensure that the input tensors are properly padded and have matching lengths for successful execution.

Calculate forward propagation.

Parameters:
- text (LongTensor) – Batch of padded character ids (Tmax).
- feats (Tensor) – Batch of padded target features (Lmax, odim).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (Tmax).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (Tmax).
- pitch (FloatTensor) – Batch of padded f0 (Tmax).
- slur (LongTensor) – Batch of padded slur (B, Tmax).
- duration (Optional *[*Dict ]) – key is “lab”, “score”; value (LongTensor): Batch of padded duration (Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (1).
Returns: Output dict including the following items: * feat_gen (Tensor): Output sequence of features (T_feats, odim).
Return type: Dict[str, Tensor]