espnet2.svs.singing_tacotron.encoder.Encoder

About 2 min

espnet2.svs.singing_tacotron.encoder.Encoder

class espnet2.svs.singing_tacotron.encoder.Encoder(idim, input_layer='embed', embed_dim=512, elayers=1, eunits=512, econv_layers=3, econv_chans=512, econv_filts=5, use_batch_norm=True, use_residual=False, dropout_rate=0.5, padding_idx=0)

Bases: Module

Singing Tacotron encoder related modules.

This module contains the implementation of the Encoder class, which is part of the Spectrogram prediction network in Singing Tacotron. The encoder converts either a sequence of characters or acoustic features into a sequence of hidden states.

The encoder is designed based on the architecture described in

`Singing-Tacotron: Global Duration Control Attention and Dynamic Filter for
End-to-end Singing Voice Synthesis`_

for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907

idim

Dimension of the inputs.

Type: int

use_residual

Flag to indicate whether to use residual connections.

Type: bool
Parameters:
- idim (int) – Dimension of the inputs.
- input_layer (str) – Type of input layer, either ‘linear’ or ‘embed’.
- embed_dim (int , optional) – Dimension of character embedding. Defaults to 512.
- elayers (int , optional) – Number of encoder BLSTM layers. Defaults to 1.
- eunits (int , optional) – Number of encoder BLSTM units. Defaults to 512.
- econv_layers (int , optional) – Number of encoder convolutional layers. Defaults to 3.
- econv_chans (int , optional) – Number of encoder convolutional filter channels. Defaults to 512.
- econv_filts (int , optional) – Size of the encoder convolutional filters. Defaults to 5.
- use_batch_norm (bool , optional) – Whether to use batch normalization. Defaults to True.
- use_residual (bool , optional) – Whether to use residual connections. Defaults to False.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.5.
- padding_idx (int , optional) – Padding index for embeddings. Defaults to 0.
Returns: The output of the encoder after forward propagation.
Return type: Tensor
Raises:ValueError – If an unknown input_layer type is provided.

######### Examples

encoder = Encoder(idim=128, input_layer=’embed’, embed_dim=256) xs, ilens = encoder(torch.randn(32, 100), torch.tensor([100]*32))

Initialize Singing Tacotron encoder module.

Parameters:
- idim (int)
- input_layer (str) – Input layer type.
- embed_dim (int , optional)
- elayers (int , optional)
- eunits (int , optional)
- econv_layers (int , optional)
- econv_filts (int , optional)
- econv_chans (int , optional)
- use_batch_norm (bool , optional)
- use_residual (bool , optional)
- dropout_rate (float , optional)

forward(xs, ilens=None)

Singing Tacotron encoder related modules.

This module includes the encoder for the Spectrogram prediction network in Singing Tacotron, which is described in

`Singing-Tacotron: Global
Duration Control Attention and Dynamic Filter for End-to-end Singing Voice
Synthesis`_

. This encoder converts either a sequence of characters or acoustic features into a sequence of hidden states.

Filter for End-to-end Singing Voice Synthesis`: : https://arxiv.org/abs/2202.07907

idim

Dimension of the inputs.

Type: int

use_residual

Whether to use residual connections.

Type: bool
Parameters:
- xs (Tensor) – Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0.
- ilens (LongTensor) – Batch of lengths of each input batch (B,).
Returns: Batch of the sequences of encoder states (B, Tmax, eunits). LongTensor: Batch of lengths of each sequence (B,).
Return type: Tensor

######### Examples

>>> encoder = Encoder(idim=256)
>>> input_tensor = torch.randint(0, 256, (10, 20))  # Example input
>>> lengths = torch.tensor([20] * 10)  # All sequences of length 20
>>> states, seq_lengths = encoder(input_tensor, lengths)

inference(x, ilens)

Perform inference on the input sequence.

This method processes the input sequence, which can be either character IDs or acoustic features, and returns the corresponding encoder states.

Parameters:
- x (Tensor) – The sequence of character IDs (T,) or acoustic features (T, idim * encoder_reduction_factor).
- ilens (LongTensor) – Lengths of the input sequences (B,).
Returns: The sequences of encoder states (T, eunits).
Return type: Tensor

######### Examples

>>> encoder = Encoder(idim=256)
>>> character_ids = torch.tensor([1, 2, 3, 0])  # Example IDs
>>> ilens = torch.tensor([3])  # Length of input
>>> states = encoder.inference(character_ids, ilens)
>>> print(states.shape)  # Should output: torch.Size([T, eunits])