espnet2.uasr.generator.conv_generator.ConvGenerator
espnet2.uasr.generator.conv_generator.ConvGenerator
class espnet2.uasr.generator.conv_generator.ConvGenerator(input_dim: int, output_dim: int, cfg: Dict | None = None, conv_kernel: int = 3, conv_dilation: int = 1, conv_stride: int = 9, pad: int = -1, bias: str2bool = False, dropout: float = 0.0, batch_norm: str2bool = True, batch_norm_weight: float = 30.0, residual: str2bool = True)
Bases: AbsGenerator
ConvGenerator is a convolutional generator for Unsupervised Automatic Speech
Recognition (UASR). It inherits from the AbsGenerator class and is designed to generate output features from input audio features through a series of convolutional operations.
input_dim
The dimension of the input features.
- Type: int
output_dim
The dimension of the output features.
- Type: int
conv_kernel
The kernel size for the convolutional layer.
- Type: int
conv_dilation
The dilation rate for the convolutional layer.
- Type: int
conv_stride
The stride for the convolutional layer.
- Type: int
pad
The padding value for the convolutional layer.
- Type: int
bias
Whether to include a bias term in the convolutional layer.
- Type: bool
dropout
Dropout layer for regularization.
- Type: torch.nn.Dropout
batch_norm
Whether to use batch normalization.
- Type: bool
batch_norm
The weight for batch normalization.
- Type: float
residual
Whether to use a residual connection.
Type: bool
Parameters:
- input_dim (int) – The dimension of the input features.
- output_dim (int) – The dimension of the output features.
- cfg (Optional *[*Dict ]) – Configuration dictionary for initializing the generator.
- conv_kernel (int) – Kernel size for the convolutional layer (default: 3).
- conv_dilation (int) – Dilation rate for the convolutional layer (default: 1).
- conv_stride (int) – Stride for the convolutional layer (default: 9).
- pad (int) – Padding value for the convolutional layer (default: -1).
- bias (str2bool) – Whether to include bias in convolution (default: False).
- dropout (float) – Dropout rate (default: 0.0).
- batch_norm (str2bool) – Whether to use batch normalization (default: True).
- batch_norm_weight (float) – Weight for batch normalization (default: 30.0).
- residual (str2bool) – Whether to use a residual connection (default: True).
Returns: Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor], torch.Tensor]: A tuple containing the generated sample, real sample (if text is provided), intermediate features (if residual is used), and generated sample padding mask.
Raises:
- AssertionError – If the text tensor contains no non-zero elements when
- generating the real sample. –
########### Examples
Initialize the ConvGenerator
generator = ConvGenerator(input_dim=256, output_dim=128)
Forward pass through the generator
generated_sample, real_sample, inter_x, padding_mask = generator(
feats=torch.randn(10, 256, 100), text=torch.tensor([[1, 0, 0], [0, 1, 0]]), feats_padding_mask=torch.ones(10, 100).bool()
)
NOTE
The generated sample and padding mask will be based on the convolutional operations applied to the input features. The real sample is constructed based on the provided text, and it is expected that the text tensor has non-zero elements.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
bn_padded_data(feature: Tensor, padding_mask: Tensor)
Normalize the input features using batch normalization while considering
the padding mask.
This method applies batch normalization to the input feature tensor, only for the elements that are not masked by the padding mask. The elements corresponding to the padding mask are left unchanged.
- Parameters:
- feature (torch.Tensor) – The input feature tensor of shape (B, C, L), where B is the batch size, C is the number of channels, and L is the length of the sequence.
- padding_mask (torch.Tensor) – A boolean tensor of shape (B, L) that indicates which elements in the feature tensor should be considered for normalization. Elements with a value of True are included in the normalization, while those with False are ignored.
- Returns: The normalized feature tensor of the same shape as the input feature tensor, where the non-masked elements have been batch normalized.
- Return type: torch.Tensor
########### Examples
>>> import torch
>>> bn_layer = ConvGenerator(input_dim=64, output_dim=32)
>>> features = torch.randn(10, 64, 100)
>>> padding_mask = torch.ones(10, 100, dtype=torch.bool)
>>> padding_mask[:, 10:] = False
>>> normalized_features = bn_layer.bn_padded_data(features, padding_mask)
>>> print(normalized_features.shape)
torch.Size([10, 64, 100])
forward(feats: Tensor, text: Tensor | None, feats_padding_mask: Tensor)
Perform the forward pass of the convolutional generator.
This method processes the input features and generates output samples using convolutional layers. It can optionally incorporate batch normalization and residual connections based on the initialization parameters.
- Parameters:
- feats (torch.Tensor) – Input tensor of shape (batch_size, input_dim, seq_len).
- text (Optional *[*torch.Tensor ]) – Optional tensor of shape (batch_size, seq_len). Used to create a one-hot representation of the target outputs.
- feats_padding_mask (torch.Tensor) – A boolean mask of shape (batch_size, seq_len) indicating which elements are valid (True) or padded (False).
- Returns: Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor], : torch.Tensor]: A tuple containing:
- generated_sample (torch.Tensor): Output tensor of shape
(batch_size, output_dim, new_seq_len) with generated samples.
- real_sample (Optional[torch.Tensor]): One-hot encoded tensor of shape (batch_size, seq_len, output_dim) for real samples. Returns None if text is None.
- inter_x (Optional[torch.Tensor]): Intermediate tensor from the residual connection. Returns None if residual is not used.
- generated_sample_padding_mask (torch.Tensor): A mask for the generated samples of shape (batch_size, new_seq_len) indicating valid elements.
- Raises:
- AssertionError – If the input text tensor contains no non-zero elements
- or if the size of generated_sample_padding_mask does not match the –
- expected output shape. –
########### Examples
>>> generator = ConvGenerator(input_dim=256, output_dim=128)
>>> feats = torch.randn(32, 256, 50) # Example input features
>>> text = torch.randint(0, 128, (32, 50)) # Example text input
>>> feats_padding_mask = torch.ones(32, 50, dtype=torch.bool) # No padding
>>> output = generator.forward(feats, text, feats_padding_mask)
>>> generated_sample, real_sample, inter_x, mask = output
output_size()
Returns the output dimension of the convolutional generator.
This property is useful to retrieve the output size after the convolutional layers have been applied, particularly in scenarios where the generator is part of a larger model and the output dimensions need to be known for subsequent processing steps.
- Returns: The output dimension of the generator.
- Return type: int
########### Examples
generator = ConvGenerator(input_dim=128, output_dim=256) output_dim = generator.output_size() # output_dim will be 256