espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor
espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor
class espnet2.gan_svs.vits.phoneme_predictor.PhonemePredictor(vocabs: int, hidden_channels: int = 192, attention_dim: int = 192, attention_heads: int = 2, linear_units: int = 768, blocks: int = 2, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 3, positional_encoding_layer_type: str = 'rel_pos', self_attention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', normalize_before: bool = True, use_macaron_style: bool = False, use_conformer_conv: bool = False, conformer_kernel_size: int = 7, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0)
Bases: Module
Phoneme Predictor module in VISinger.
This module is designed to predict phonemes from input features using a deep learning model. It utilizes an encoder structure to process the input and outputs phoneme predictions.
phoneme_predictor
The encoder used for phoneme prediction.
- Type:Encoder
linear1
The linear layer for mapping the output to the vocabulary size.
Type:Linear
Parameters:
- vocabs (int) – The number of vocabulary.
- hidden_channels (int) – The number of hidden channels.
- attention_dim (int) – The number of attention dimension.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of linear units.
- blocks (int) – The number of encoder blocks.
- positionwise_layer_type (str) – The type of position-wise layer.
- positionwise_conv_kernel_size (int) – The size of position-wise convolution kernel.
- positional_encoding_layer_type (str) – The type of positional encoding layer.
- self_attention_layer_type (str) – The type of self-attention layer.
- activation_type (str) – The type of activation function.
- normalize_before (bool) – Whether to apply normalization before the position-wise layer or not.
- use_macaron_style (bool) – Whether to use macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer convolution or not.
- conformer_kernel_size (int) – The size of Conformer kernel.
- dropout_rate (float) – The dropout rate.
- positional_dropout_rate (float) – The dropout rate for positional encoding.
- attention_dropout_rate (float) – The dropout rate for attention.
####### Examples
Create a PhonemePredictor instance
predictor = PhonemePredictor(vocabs=50)
Forward pass with random input and mask
input_tensor = torch.randn(32, 192, 100) # (Batch, Dim, Length) input_mask = torch.ones(32, 100) # (Batch, Length) output = predictor(input_tensor, input_mask) print(output.shape) # Should output (100, 32, 50)
- Raises:ValueError – If any of the arguments are invalid (e.g., negative values).
NOTE
This module requires PyTorch and is designed to be used within the ESPnet framework.
Initialize PhonemePredictor module.
- Parameters:
- vocabs (int) – The number of vocabulary.
- hidden_channels (int) – The number of hidden channels.
- attention_dim (int) – The number of attention dimension.
- attention_heads (int) – The number of attention heads.
- linear_units (int) – The number of linear units.
- blocks (int) – The number of encoder blocks.
- positionwise_layer_type (str) – The type of position-wise layer.
- positionwise_conv_kernel_size (int) – The size of position-wise convolution kernel.
- positional_encoding_layer_type (str) – The type of positional encoding layer.
- self_attention_layer_type (str) – The type of self-attention layer.
- activation_type (str) – The type of activation function.
- normalize_before (bool) – Whether to apply normalization before the position-wise layer or not.
- use_macaron_style (bool) – Whether to use macaron style or not.
- use_conformer_conv (bool) – Whether to use Conformer convolution or not.
- conformer_kernel_size (int) – The size of Conformer kernel.
- dropout_rate (float) – The dropout rate.
- positional_dropout_rate (float) – The dropout rate for positional encoding.
- attention_dropout_rate (float) – The dropout rate for attention.
forward(x, x_mask)
Perform forward propagation through the Phoneme Predictor.
This method takes an input tensor and its corresponding mask, processes them through the phoneme predictor encoder, and returns the predicted phoneme probabilities.
- Parameters:
- x (Tensor) – The input tensor of shape (B, dim, length), where B is the batch size, dim is the number of features, and length is the sequence length.
- x_mask (Tensor) – The mask tensor for the input tensor of shape (B, length), used to ignore padding in the input during processing.
- Returns: The predicted phoneme tensor of shape (length, B, vocab_size), : where vocab_size is the number of phoneme classes. The tensor contains log probabilities for each phoneme at each position in the input sequence.
- Return type: Tensor
####### Examples
>>> model = PhonemePredictor(vocabs=50)
>>> input_tensor = torch.rand(32, 128, 100) # (B, dim, length)
>>> input_mask = torch.ones(32, 100) # (B, length)
>>> output = model.forward(input_tensor, input_mask)
>>> print(output.shape) # Output: (100, 32, 50)
NOTE
Ensure that the input tensor and mask are properly aligned and that the mask has the correct shape to avoid issues during processing.
- Raises:ValueError – If the dimensions of the input tensor and mask do not match.