espnet2.tts.fastspeech2.variance_predictor.VariancePredictor
espnet2.tts.fastspeech2.variance_predictor.VariancePredictor
class espnet2.tts.fastspeech2.variance_predictor.VariancePredictor(idim: int, n_layers: int = 2, n_chans: int = 384, kernel_size: int = 3, bias: bool = True, dropout_rate: float = 0.5)
Bases: Module
Variance predictor module.
This module implements the variance predictor described in FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
conv
List of convolutional layers for variance prediction.
- Type: torch.nn.ModuleList
linear
Linear layer for output prediction.
Type: torch.nn.Linear
Parameters:
- idim (int) – Input dimension.
- n_layers (int) – Number of convolutional layers.
- n_chans (int) – Number of channels of convolutional layers.
- kernel_size (int) – Kernel size of convolutional layers.
- bias (bool) – Whether to use bias in convolutional layers.
- dropout_rate (float) – Dropout rate.
####### Examples
>>> vp = VariancePredictor(idim=256)
>>> input_tensor = torch.rand(8, 100, 256) # (B, Tmax, idim)
>>> masks = torch.zeros(8, 100, dtype=torch.uint8) # No padding
>>> output = vp(input_tensor, masks)
>>> print(output.shape) # Output shape will be (8, 100, 1)
- Raises:TypeError – If any of the arguments are of the wrong type.
NOTE
This module is designed for use in text-to-speech synthesis models.
Initilize duration predictor module.
- Parameters:
- idim (int) – Input dimension.
- n_layers (int) – Number of convolutional layers.
- n_chans (int) – Number of channels of convolutional layers.
- kernel_size (int) – Kernel size of convolutional layers.
- dropout_rate (float) – Dropout rate.
forward(xs: Tensor, x_masks: Tensor | None = None) → Tensor
Calculate forward propagation.
This method processes the input sequences through the convolutional layers and returns the predicted variance for each sequence. It can handle padded inputs using the provided masks.
- Parameters:
- xs (Tensor) – Batch of input sequences with shape (B, Tmax, idim).
- x_masks (ByteTensor , optional) – Batch of masks indicating padded parts with shape (B, Tmax). Default is None.
- Returns: Batch of predicted sequences with shape (B, Tmax, 1).
- Return type: Tensor
####### Examples
>>> vp = VariancePredictor(idim=80)
>>> input_tensor = torch.rand(32, 100, 80) # (B, Tmax, idim)
>>> mask_tensor = torch.zeros(32, 100, dtype=torch.bool) # No padding
>>> output = vp.forward(input_tensor, mask_tensor)
>>> print(output.shape) # Should print: torch.Size([32, 100, 1])
NOTE
Ensure that the input tensor xs is appropriately shaped and the mask tensor, if provided, matches the dimensions of xs.