espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor

About 2 min

espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor

class espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor(vocabs: int, hidden_channels: int = 192, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 2, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)

Bases: Module

Phoneme Predictor module in VISinger.

This module is designed to predict phonemes from input features using a deep learning model. It utilizes an encoder structure to process the input and outputs phoneme predictions.

phoneme_predictor

The encoder used for phoneme prediction.

Type:Encoder

linear1

The linear layer for mapping the output to the vocabulary size.

Type:Linear
Parameters:
- vocabs (int) – The number of vocabulary.
- hidden_channels (int) – The number of hidden channels.
- attention_dim (int) – The number of attention dimension.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of linear units.
- blocks (int) – The number of encoder blocks.
- positionwise_layer_type (str) – The type of position-wise layer.
- positionwise_conv_kernel_size (int) – The size of position-wise convolution kernel.
- positional_encoding_layer_type (str) – The type of positional encoding layer.
- self_attention_layer_type (str) – The type of self-attention layer.
- activation_type (str) – The type of activation function.
- normalize_before (bool) – Whether to apply normalization before the position-wise layer or not.
- use_macaron_style (bool) – Whether to use macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer convolution or not.
- conformer_kernel_size (int) – The size of Conformer kernel.
- dropout_rate (float) – The dropout rate.
- positional_dropout_rate (float) – The dropout rate for positional encoding.
- attention_dropout_rate (float) – The dropout rate for attention.

####### Examples

Create a PhonemePredictor instance

predictor = PhonemePredictor(vocabs=50)

Forward pass with random input and mask

input_tensor = torch.randn(32, 192, 100) # (Batch, Dim, Length) input_mask = torch.ones(32, 100) # (Batch, Length) output = predictor(input_tensor, input_mask) print(output.shape) # Should output (100, 32, 50)

Raises:ValueError – If any of the arguments are invalid (e.g., negative values).

NOTE

This module requires PyTorch and is designed to be used within the ESPnet framework.

Initialize PhonemePredictor module.

Parameters:
- vocabs (int) – The number of vocabulary.
- hidden_channels (int) – The number of hidden channels.
- attention_dim (int) – The number of attention dimension.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of linear units.
- blocks (int) – The number of encoder blocks.
- positionwise_layer_type (str) – The type of position-wise layer.
- positionwise_conv_kernel_size (int) – The size of position-wise convolution kernel.
- positional_encoding_layer_type (str) – The type of positional encoding layer.
- self_attention_layer_type (str) – The type of self-attention layer.
- activation_type (str) – The type of activation function.
- normalize_before (bool) – Whether to apply normalization before the position-wise layer or not.
- use_macaron_style (bool) – Whether to use macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer convolution or not.
- conformer_kernel_size (int) – The size of Conformer kernel.
- dropout_rate (float) – The dropout rate.
- positional_dropout_rate (float) – The dropout rate for positional encoding.
- attention_dropout_rate (float) – The dropout rate for attention.

forward(x, x_mask)

Perform forward propagation through the Phoneme Predictor.

This method takes an input tensor and its corresponding mask, processes them through the phoneme predictor encoder, and returns the predicted phoneme probabilities.

Parameters:
- x (Tensor) – The input tensor of shape (B, dim, length), where B is the batch size, dim is the number of features, and length is the sequence length.
- x_mask (Tensor) – The mask tensor for the input tensor of shape (B, length), used to ignore padding in the input during processing.
Returns: The predicted phoneme tensor of shape (length, B, vocab_size), : where vocab_size is the number of phoneme classes. The tensor contains log probabilities for each phoneme at each position in the input sequence.
Return type: Tensor

####### Examples

>>> model = PhonemePredictor(vocabs=50)
>>> input_tensor = torch.rand(32, 128, 100)  # (B, dim, length)
>>> input_mask = torch.ones(32, 100)          # (B, length)
>>> output = model.forward(input_tensor, input_mask)
>>> print(output.shape)  # Output: (100, 32, 50)

NOTE

Ensure that the input tensor and mask are properly aligned and that the mask has the correct shape to avoid issues during processing.

Raises:ValueError – If the dimensions of the input tensor and mask do not match.