espnet2.svs.singing_tacotron.encoder.Duration_Encoder
espnet2.svs.singing_tacotron.encoder.Duration_Encoder
class espnet2.svs.singing_tacotron.encoder.Duration_Encoder(idim, embed_dim=512, dropout_rate=0.5, padding_idx=0)
Bases: Module
Duration_Encoder module of Spectrogram prediction network.
This module is part of the Singing Tacotron architecture. It converts a sequence of durations and tempo features into a transition token, which is essential for generating singing voice synthesis. The architecture follows the principles outlined in the paper
`Singing-Tacotron: Global
Duration Control Attention and Dynamic Filter for End-to-end Singing
Voice Synthesis`_
.
Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907
idim
Dimension of the inputs.
Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- embed_dim (int , optional) – Dimension of character embedding. Default is 512.
- dropout_rate (float , optional) – Dropout rate. Default is 0.5.
- padding_idx (int , optional) – Padding index for embedding. Default is 0.
######### Examples
>>> duration_encoder = Duration_Encoder(idim=10)
>>> input_tensor = torch.rand(2, 5, 10) # Batch of 2, Tmax=5, feature_len=10
>>> output = duration_encoder(input_tensor)
>>> print(output.shape) # Should output: torch.Size([2, 5, 1])
Initialize Singing-Tacotron encoder module.
- Parameters:
- idim (int)
- embed_dim (int , optional)
- dropout_rate (float , optional)
forward(xs)
Calculate forward propagation.
This method computes the forward pass of the Duration_Encoder module, transforming the input duration sequence into transition tokens.
Parameters:xs (Tensor) – Batch of the duration sequence with shape (B, Tmax, feature_len).
Returns: Batch of the sequences of transition tokens with shape : (B, Tmax, 1).
LongTensor: Batch of lengths of each sequence (B,).
Return type: Tensor
######### Examples
>>> encoder = Duration_Encoder(idim=10)
>>> duration_sequence = torch.rand(4, 5, 10) # Example input
>>> output = encoder.forward(duration_sequence)
>>> print(output.shape)
torch.Size([4, 5, 1]) # Output shape
inference(x)
Inference.
This method performs inference by processing a sequence of character IDs or acoustic features and returning the corresponding encoder states.
- Parameters:
- x (Tensor) – The sequence of character IDs (T,) or acoustic features (T, idim * encoder_reduction_factor).
- ilens (LongTensor) – The lengths of each input sequence (B,).
- Returns: The sequences of encoder states (T, eunits).
- Return type: Tensor
######### Examples
>>> encoder = Encoder(idim=40)
>>> char_ids = torch.tensor([1, 2, 3, 4])
>>> lengths = torch.tensor([4])
>>> states = encoder.inference(char_ids, lengths)
>>> print(states.shape)
torch.Size([4, 512]) # Assuming eunits is 512