espnet2.svs.singing_tacotron.decoder.Decoder
espnet2.svs.singing_tacotron.decoder.Decoder
class espnet2.svs.singing_tacotron.decoder.Decoder(idim, odim, att, dlayers=2, dunits=1024, prenet_layers=2, prenet_units=256, postnet_layers=5, postnet_chans=512, postnet_filts=5, output_activation_fn=None, cumulate_att_w=True, use_batch_norm=True, use_concate=True, dropout_rate=0.5, zoneout_rate=0.1, reduction_factor=1)
Bases: Module
Decoder module of Spectrogram prediction network.
This is a module of the decoder of the Spectrogram prediction network in Singing Tacotron, which is described in
`Singing-Tacotron: Global Duration Control Attention
and Dynamic Filter for End-to-end Singing Voice Synthesis`_
.
Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/pdf/2202.07907v1.pdf
idim
Dimension of the inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
att
Instance of the attention class.
- Type: torch.nn.Module
output_activation_fn
Activation function for outputs.
- Type: torch.nn.Module or None
cumulate_att_w
Whether to cumulate previous attention weight.
- Type: bool
use_concate
Whether to concatenate encoder embedding with decoder LSTM outputs.
- Type: bool
reduction_factor
Reduction factor.
Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder LSTM layers. Defaults to 2.
- dunits (int , optional) – The number of decoder LSTM units. Defaults to 1024.
- prenet_layers (int , optional) – The number of prenet layers. Defaults to 2.
- prenet_units (int , optional) – The number of prenet units. Defaults to 256.
- postnet_layers (int , optional) – The number of postnet layers. Defaults to 5.
- postnet_chans (int , optional) – The number of postnet filter channels. Defaults to 512.
- postnet_filts (int , optional) – The number of postnet filter size. Defaults to 5.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight. Defaults to True.
- use_batch_norm (bool , optional) – Whether to use batch normalization. Defaults to True.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder LSTM outputs. Defaults to True.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.5.
- zoneout_rate (float , optional) – Zoneout rate. Defaults to 0.1.
- reduction_factor (int , optional) – Reduction factor. Defaults to 1.
######### Examples
decoder = Decoder(idim=80, odim=80, att=some_attention_instance) output, before_out, logits, att_ws = decoder(hs, hlens, trans_token, ys)
####### NOTE The forward computation is performed in a teacher-forcing manner.
- Raises:ValueError – If the dimensions of input tensors do not match the expected dimensions.
Initialize Singing Tacotron decoder module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder lstm layers.
- dunits (int , optional) – The number of decoder lstm units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder lstm outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.
forward(hs, hlens, trans_token, ys)
Singing Tacotron decoder related modules.
This module implements the Decoder class for the Singing Tacotron model, which generates sequences of features from sequences of hidden states.
idim
Dimension of the inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
att
Instance of the attention class.
- Type: torch.nn.Module
output_activation_fn
Activation function for outputs.
- Type: torch.nn.Module, optional
cumulate_att_w
Whether to cumulate previous attention weight.
- Type: bool
use_concate
Whether to concatenate encoder embedding with decoder LSTM outputs.
- Type: bool
reduction_factor
Reduction factor.
Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- att (torch.nn.Module) – Instance of attention class.
- dlayers (int , optional) – The number of decoder LSTM layers.
- dunits (int , optional) – The number of decoder LSTM units.
- prenet_layers (int , optional) – The number of prenet layers.
- prenet_units (int , optional) – The number of prenet units.
- postnet_layers (int , optional) – The number of postnet layers.
- postnet_filts (int , optional) – The number of postnet filter size.
- postnet_chans (int , optional) – The number of postnet filter channels.
- output_activation_fn (torch.nn.Module , optional) – Activation function for outputs.
- cumulate_att_w (bool , optional) – Whether to cumulate previous attention weight.
- use_batch_norm (bool , optional) – Whether to use batch normalization.
- use_concate (bool , optional) – Whether to concatenate encoder embedding with decoder LSTM outputs.
- dropout_rate (float , optional) – Dropout rate.
- zoneout_rate (float , optional) – Zoneout rate.
- reduction_factor (int , optional) – Reduction factor.
Returns: Batch of output tensors after postnet (B, Lmax, odim). Tensor: Batch of output tensors before postnet (B, Lmax, odim). Tensor: Batch of logits of stop prediction (B, Lmax). Tensor: Batch of attention weights (B, Lmax, Tmax).
Return type: Tensor
####### NOTE This computation is performed in teacher-forcing manner.
######### Examples
decoder = Decoder(idim=80, odim=80, att=attention_instance) output, before_output, logits, att_weights = decoder.forward(hs, hlens, trans_token, ys)
inference(h, trans_token, threshold=0.5, minlenratio=0.0, maxlenratio=30.0, use_att_constraint=False, use_dynamic_filter=True, backward_window=1, forward_window=3)
Generate the sequence of features given the sequences of characters.
- Parameters:
- h (Tensor) – Input sequence of encoder hidden states (T, C).
- trans_token (Tensor) – Global transition token for duration.
- threshold (float , optional) – Threshold to stop generation.
- minlenratio (float , optional) – Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10.
- maxlenratio (float , optional) – Maximum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100.
- use_att_constraint (bool) – Whether to apply attention constraint introduced in Deep Voice 3.
- use_dynamic_filter (bool) – Whether to apply dynamic filter introduced in Singing Tacotron.
- backward_window (int) – Backward window size in attention constraint.
- forward_window (int) – Forward window size in attention constraint.
- Returns: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Attention weights (L, T).
- Return type: Tensor
####### NOTE This computation is performed in auto-regressive manner.
######### Examples
>>> h = torch.randn(50, 256) # Example hidden states
>>> trans_token = torch.randn(50, 1) # Example transition token
>>> outs, probs, att_ws = decoder.inference(h, trans_token)