espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor

About 2 min

espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor

class espnet2.gan_tts.vits.duration_predictor.StochasticDurationPredictor(channels: int = 192, kernel_size: int = 3, dropout_rate: float = 0.5, flows: int = 4, dds_conv_layers: int = 3, global_channels: int = -1)

Bases: Module

Stochastic duration predictor module.

This module implements a stochastic duration predictor as described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

pre

Convolutional layer for preprocessing input.

Type: torch.nn.Conv1d

dds

Dilated depth separable convolution layer.

Type:DilatedDepthSeparableConv

proj

Convolutional layer for projecting features.

Type: torch.nn.Conv1d

log_flow

Log flow for managing the flow of information.

Type:LogFlow

flows

List of flow modules for processing.

Type: torch.nn.ModuleList

post_pre

Convolutional layer for post-processing input.

Type: torch.nn.Conv1d

post_dds

Post-processing dilated depth separable convolution layer.

Type:DilatedDepthSeparableConv

post_proj

Convolutional layer for projecting post-processed features.

Type: torch.nn.Conv1d

post_flows

List of post-processing flow modules.

Type: torch.nn.ModuleList

global_conv

Convolutional layer for global conditioning if global_channels > 0.

Type: torch.nn.Conv1d, optional
Parameters:
- channels (int) – Number of channels.
- kernel_size (int) – Kernel size.
- dropout_rate (float) – Dropout rate.
- flows (int) – Number of flows.
- dds_conv_layers (int) – Number of conv layers in DDS conv.
- global_channels (int) – Number of global conditioning channels.

####### Examples

>>> predictor = StochasticDurationPredictor()
>>> x = torch.randn(2, 192, 50)  # Example input tensor
>>> x_mask = torch.ones(2, 1, 50)  # Example mask tensor
>>> duration = torch.randn(2, 1, 50)  # Example duration tensor
>>> output = predictor(x, x_mask, w=duration)

Raises:AssertionError – If inverse is False and w is None in the forward method.

Initialize StochasticDurationPredictor module.

Parameters:
- channels (int) – Number of channels.
- kernel_size (int) – Kernel size.
- dropout_rate (float) – Dropout rate.
- flows (int) – Number of flows.
- dds_conv_layers (int) – Number of conv layers in DDS conv.
- global_channels (int) – Number of global conditioning channels.

forward(x: Tensor, x_mask: Tensor, w: Tensor | None = None, g: Tensor | None = None, inverse: bool = False, noise_scale: float = 1.0) → Tensor

Calculate forward propagation.

This method performs the forward pass for the Stochastic Duration Predictor. It computes the negative log-likelihood (NLL) or log-duration tensor based on the provided input tensors and optional parameters.

Parameters:
- x (Tensor) – Input tensor with shape (B, channels, T_text).
- x_mask (Tensor) – Mask tensor with shape (B, 1, T_text).
- w (Optional *[*Tensor ]) – Duration tensor with shape (B, 1, T_text). Required when inverse is False.
- g (Optional *[*Tensor ]) – Global conditioning tensor with shape (B, channels, 1).
- inverse (bool) – Whether to perform the inverse operation on the flow. Defaults to False.
- noise_scale (float) – Scale for the noise added to the latent space. Defaults to 1.0.
Returns: If inverse is False, returns a negative log-likelihood (NLL) tensor with shape (B,). If inverse is True, returns a log-duration tensor with shape (B, 1, T_text).
Return type: Tensor

####### Examples

>>> model = StochasticDurationPredictor()
>>> x = torch.randn(5, 192, 10)  # Example input tensor
>>> x_mask = torch.ones(5, 1, 10)  # Example mask tensor
>>> w = torch.randn(5, 1, 10)  # Example duration tensor
>>> output = model.forward(x, x_mask, w)  # Forward pass