espnet2.svs.singing_tacotron.encoder.Encoder
espnet2.svs.singing_tacotron.encoder.Encoder
class espnet2.svs.singing_tacotron.encoder.Encoder(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)
Bases: Module
Singing Tacotron encoder related modules.
This module contains the implementation of the Encoder class, which is part of the Spectrogram prediction network in Singing Tacotron. The encoder converts either a sequence of characters or acoustic features into a sequence of hidden states.
The encoder is designed based on the architecture described in
`Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for
End-to-end Singing Voice Synthesis`_
.
for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907
idim
Dimension of the inputs.
- Type: int
use_residual
Flag to indicate whether to use residual connections.
Type: bool
Parameters:
- idim (int) – Dimension of the inputs.
- input_layer (str) – Type of input layer, either ‘linear’ or ‘embed’.
- embed_dim (int , optional) – Dimension of character embedding. Defaults to 512.
- elayers (int , optional) – Number of encoder BLSTM layers. Defaults to 1.
- eunits (int , optional) – Number of encoder BLSTM units. Defaults to 512.
- econv_layers (int , optional) – Number of encoder convolutional layers. Defaults to 3.
- econv_chans (int , optional) – Number of encoder convolutional filter channels. Defaults to 512.
- econv_filts (int , optional) – Size of the encoder convolutional filters. Defaults to 5.
- use_batch_norm (bool , optional) – Whether to use batch normalization. Defaults to True.
- use_residual (bool , optional) – Whether to use residual connections. Defaults to False.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.5.
- padding_idx (int , optional) – Padding index for embeddings. Defaults to 0.
Returns: The output of the encoder after forward propagation.
Return type: Tensor
Raises:ValueError – If an unknown input_layer type is provided.
######### Examples
encoder = Encoder(idim=128, input_layer=’embed’, embed_dim=256) xs, ilens = encoder(torch.randn(32, 100), torch.tensor([100]*32))
Initialize Singing Tacotron encoder module.
- Parameters:
- idim (int)
- input_layer (str) – Input layer type.
- embed_dim (int , optional)
- elayers (int , optional)
- eunits (int , optional)
- econv_layers (int , optional)
- econv_filts (int , optional)
- econv_chans (int , optional)
- use_batch_norm (bool , optional)
- use_residual (bool , optional)
- dropout_rate (float , optional)
forward(xs, ilens=None)
Singing Tacotron encoder related modules.
This module includes the encoder for the Spectrogram prediction network in Singing Tacotron, which is described in
`Singing-Tacotron: Global
Duration Control Attention and Dynamic Filter for End-to-end Singing Voice
Synthesis`_
. This encoder converts either a sequence of characters or acoustic features into a sequence of hidden states.
Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907
idim
Dimension of the inputs.
- Type: int
use_residual
Whether to use residual connections.
Type: bool
Parameters:
- xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
Returns: Batch of the sequences of encoder states (B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,).
Return type: Tensor
######### Examples
>>> encoder = Encoder(idim=256)
>>> input_tensor = torch.randint(0, 256, (10, 20)) # Example input
>>> lengths = torch.tensor([20] * 10) # All sequences of length 20
>>> states, seq_lengths = encoder(input_tensor, lengths)
inference(x, ilens)
Perform inference on the input sequence.
This method processes the input sequence, which can be either character IDs or acoustic features, and returns the corresponding encoder states.
- Parameters:
- x (Tensor) – The sequence of character IDs (T,) or acoustic features (T, idim * encoder_reduction_factor).
- ilens (LongTensor) – Lengths of the input sequences (B,).
- Returns: The sequences of encoder states (T, eunits).
- Return type: Tensor
######### Examples
>>> encoder = Encoder(idim=256)
>>> character_ids = torch.tensor([1, 2, 3, 0]) # Example IDs
>>> ilens = torch.tensor([3]) # Length of input
>>> states = encoder.inference(character_ids, ilens)
>>> print(states.shape) # Should output: torch.Size([T, eunits])