espnet2.svs.singing_tacotron.decoder.Decoder

About 4 min

espnet2.svs.singing_tacotron.decoder.Decoder

class espnet2.svs.singing_tacotron.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)

Bases: Module

Decoder module of Spectrogram prediction network.

This is a module of the decoder of the Spectrogram prediction network in Singing Tacotron, which is described in

`Singing-Tacotron: Global Duration Control Attention
and Dynamic Filter for End-to-end Singing Voice Synthesis`_

Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/pdf/2202.07907v1.pdf

idim

Dimension of the inputs.

Type: int

odim

Dimension of the outputs.

Type: int

att

Instance of the attention class.

Type: torch.nn.Module

output_activation_fn

Activation function for outputs.

Type: torch.nn.Module or None

cumulate_att_w

Whether to cumulate previous attention weight.

Type: bool

use_concate

Whether to concatenate encoder embedding with decoder LSTM outputs.

Type: bool

reduction_factor

Reduction factor.

Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder LSTM layers. Defaults to 2.
- dunits (int , optional) – The number of decoder LSTM units. Defaults to 1024.
- prenet_layers (int , optional) – The number of prenet layers. Defaults to 2.
- prenet_units (int , optional) – The number of prenet units. Defaults to 256.
- postnet_layers (int , optional) – The number of postnet layers. Defaults to 5.
- postnet_chans (int , optional) – The number of postnet filter channels. Defaults to 512.
- postnet_filts (int , optional) – The number of postnet filter size. Defaults to 5.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight. Defaults to True.
- use_batch_norm (bool , optional) – Whether to use batch normalization. Defaults to True.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder LSTM outputs. Defaults to True.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.5.
- zoneout_rate (float , optional) – Zoneout rate. Defaults to 0.1.
- reduction_factor (int , optional) – Reduction factor. Defaults to 1.

######### Examples

decoder = Decoder(idim=80, odim=80, att=some_attention_instance) output, before_out, logits, att_ws = decoder(hs, hlens, trans_token, ys)

####### NOTE The forward computation is performed in a teacher-forcing manner.

Raises:ValueError – If the dimensions of input tensors do not match the expected dimensions.

Initialize Singing Tacotron decoder module.

Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder lstm layers.
- dunits (int , optional) – The number of decoder lstm units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.

forward(hs, hlens, trans_token, ys)

Singing Tacotron decoder related modules.

This module implements the Decoder class for the Singing Tacotron model, which generates sequences of features from sequences of hidden states.

idim

Dimension of the inputs.

Type: int

odim

Dimension of the outputs.

Type: int

att

Instance of the attention class.

Type: torch.nn.Module

output_activation_fn

Activation function for outputs.

Type: torch.nn.Module, optional

cumulate_att_w

Whether to cumulate previous attention weight.

Type: bool

use_concate

Whether to concatenate encoder embedding with decoder LSTM outputs.

Type: bool

reduction_factor

Reduction factor.

Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder LSTM layers.
- dunits (int , optional) – The number of decoder LSTM units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder LSTM outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.
Returns: Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
Return type: Tensor

####### NOTE This computation is performed in teacher-forcing manner.

######### Examples

decoder = Decoder(idim=80, odim=80, att=attention_instance) output, before_output, logits, att_weights = decoder.forward(hs, hlens, trans_token, ys)

inference(h, trans_token, threshold=0.5, minlenratio=0.0, maxlenratio=30.0, use_att_constraint=False, use_dynamic_filter=True, backward_window=1, forward_window=3)

Generate the sequence of features given the sequences of characters.

Parameters:
- h (Tensor) – Input sequence of encoder hidden states (T, C).
- trans_token (Tensor) – Global transition token for duration.
- threshold (float , optional) – Threshold to stop generation.
- minlenratio (float , optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
- maxlenratio (float , optional) – Maximum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
- use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
- use_dynamic_filter (bool) – Whether to apply dynamic filter introduced in Singing Tacotron.
- backward_window (int) – Backward window size in attention constraint.
- forward_window (int) – Forward window size in attention constraint.
Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
Return type: Tensor

####### NOTE This computation is performed in auto-regressive manner.

######### Examples

>>> h = torch.randn(50, 256)  # Example hidden states
>>> trans_token = torch.randn(50, 1)  # Example transition token
>>> outs, probs, att_ws = decoder.inference(h, trans_token)