espnet2.gan_svs.vits.duration_predictor.DurationPredictor

About 2 min

espnet2.gan_svs.vits.duration_predictor.DurationPredictor

class espnet2.gan_svs.vits.duration_predictor.DurationPredictor(channels, filter_channels, kernel_size, dropout_rate, global_channels=0)

Bases: Module

DurationPredictor is a module that predicts durations for audio signals in the

VISinger framework.

This class utilizes convolutional layers and normalization to process input features and predict the duration of each time step in the sequence. It can optionally take global conditioning inputs, which can be used for multi-singer applications.

in_channels

Number of input channels.

Type: int

filter_channels

Number of filter channels for convolutional layers.

Type: int

kernel_size

Size of the convolutional kernel.

Type: int

dropout_rate

Rate at which to drop units during training.

Type: float
Parameters:
- channels (int) – Number of input channels.
- filter_channels (int) – Number of filter channels.
- kernel_size (int) – Size of the convolutional kernel.
- dropout_rate (float) – Dropout rate.
- global_channels (int , optional) – Number of global conditioning channels.

forward(x, x_mask, g=None)

Forward pass through the duration predictor module.

Returns: Predicted duration tensor of shape (B, 2, T), where B is the batch size and T is the length of the input sequence.
Return type: Tensor

####### Examples

>>> predictor = DurationPredictor(128, 256, 3, 0.1)
>>> x = torch.randn(32, 128, 100)  # Example input tensor
>>> x_mask = torch.ones(32, 1, 100)  # Example mask tensor
>>> output = predictor(x, x_mask)
>>> print(output.shape)  # Should output: torch.Size([32, 2, 100])

NOTE

This module is designed for use within the ESPnet framework, specifically for voice synthesis tasks. The input tensor should be appropriately shaped and masked to avoid influencing the output predictions with padded values.

Initialize duration predictor module.

Parameters:
- channels (int) – Number of input channels.
- filter_channels (int) – Number of filter channels.
- kernel_size (int) – Size of the convolutional kernel.
- dropout_rate (float) – Dropout rate.
- global_channels (int , optional) – Number of global conditioning channels.

forward(x, x_mask, g=None)

Forward pass through the duration predictor module.

This method processes the input tensor through a series of convolutional layers, applying layer normalization and dropout, and optionally includes global conditioning. It produces a predicted duration tensor that can be used for various downstream tasks in duration prediction.

Parameters:
- x (Tensor) – Input tensor of shape (B, in_channels, T), where B is the batch size, in_channels is the number of input channels, and T is the sequence length.
- x_mask (Tensor) – Mask tensor of shape (B, 1, T) used to mask out padding or irrelevant parts of the input during processing.
- g (Tensor , optional) – Global condition tensor of shape (B, global_channels, 1) used for multi-singer scenarios. If provided, this tensor is added to the input tensor after being processed through a convolutional layer. Defaults to None.
Returns: Predicted duration tensor of shape (B, 2, T), where the second dimension corresponds to the predicted durations.
Return type: Tensor

####### Examples

>>> duration_predictor = DurationPredictor(channels=256,
...                                          filter_channels=512,
...                                          kernel_size=3,
...                                          dropout_rate=0.1)
>>> x = torch.randn(10, 256, 100)  # Batch of 10, 256 channels, length 100
>>> x_mask = torch.ones(10, 1, 100)  # No masking
>>> output = duration_predictor(x, x_mask)
>>> print(output.shape)
torch.Size([10, 2, 100])  # Expected output shape