espnet2.svs.singing_tacotron.encoder.Duration_Encoder

About 1 min

espnet2.svs.singing_tacotron.encoder.Duration_Encoder

class espnet2.svs.singing_tacotron.encoder.Duration_Encoder(idim, embed_dim=512, dropout_rate=0.5, padding_idx=0)

Bases: Module

Duration_Encoder module of Spectrogram prediction network.

This module is part of the Singing Tacotron architecture. It converts a sequence of durations and tempo features into a transition token, which is essential for generating singing voice synthesis. The architecture follows the principles outlined in the paper

`Singing-Tacotron: Global
Duration Control Attention and Dynamic Filter for End-to-end Singing
Voice Synthesis`_

Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907

idim

Dimension of the inputs.

Type: int
Parameters:
- idim (int) – Dimension of the inputs.
- embed_dim (int , optional) – Dimension of character embedding. Default is 512.
- dropout_rate (float , optional) – Dropout rate. Default is 0.5.
- padding_idx (int , optional) – Padding index for embedding. Default is 0.

######### Examples

>>> duration_encoder = Duration_Encoder(idim=10)
>>> input_tensor = torch.rand(2, 5, 10)  # Batch of 2, Tmax=5, feature_len=10
>>> output = duration_encoder(input_tensor)
>>> print(output.shape)  # Should output: torch.Size([2, 5, 1])

Initialize Singing-Tacotron encoder module.

Parameters:
- idim (int)
- embed_dim (int , optional)
- dropout_rate (float , optional)

forward(xs)

Calculate forward propagation.

This method computes the forward pass of the Duration_Encoder module, transforming the input duration sequence into transition tokens.

Parameters:xs (Tensor) – Batch of the duration sequence with shape (B, Tmax, feature_len).
Returns: Batch of the sequences of transition tokens with shape : (B, Tmax, 1).
LongTensor: Batch of lengths of each sequence (B,).
Return type: Tensor

######### Examples

>>> encoder = Duration_Encoder(idim=10)
>>> duration_sequence = torch.rand(4, 5, 10)  # Example input
>>> output = encoder.forward(duration_sequence)
>>> print(output.shape)
torch.Size([4, 5, 1])  # Output shape

inference(x)

Inference.

This method performs inference by processing a sequence of character IDs or acoustic features and returning the corresponding encoder states.

Parameters:
- x (Tensor) – The sequence of character IDs (T,) or acoustic features (T, idim * encoder_reduction_factor).
- ilens (LongTensor) – The lengths of each input sequence (B,).
Returns: The sequences of encoder states (T, eunits).
Return type: Tensor

######### Examples

>>> encoder = Encoder(idim=40)
>>> char_ids = torch.tensor([1, 2, 3, 4])
>>> lengths = torch.tensor([4])
>>> states = encoder.inference(char_ids, lengths)
>>> print(states.shape)
torch.Size([4, 512])  # Assuming eunits is 512