espnet2.uasr.generator.conv_generator.ConvGenerator

About 4 min

espnet2.uasr.generator.conv_generator.ConvGenerator

class espnet2.uasr.generator.conv_generator.ConvGenerator(input_dim: int, output_dim: int, cfg: Dict | None = None, conv_kernel: int = 3, conv_dilation: int = 1, conv_stride: int = 9, pad: int = -1, bias: str2bool = False, dropout: float = 0.0, batch_norm: str2bool = True, batch_norm_weight: float = 30.0, residual: str2bool = True)

Bases: AbsGenerator

ConvGenerator is a convolutional generator for Unsupervised Automatic Speech

Recognition (UASR). It inherits from the AbsGenerator class and is designed to generate output features from input audio features through a series of convolutional operations.

input_dim

The dimension of the input features.

Type: int

output_dim

The dimension of the output features.

Type: int

conv_kernel

The kernel size for the convolutional layer.

Type: int

conv_dilation

The dilation rate for the convolutional layer.

Type: int

conv_stride

The stride for the convolutional layer.

Type: int

pad

The padding value for the convolutional layer.

Type: int

bias

Whether to include a bias term in the convolutional layer.

Type: bool

dropout

Dropout layer for regularization.

Type: torch.nn.Dropout

batch_norm

Whether to use batch normalization.

Type: bool

batch_norm

_weight

The weight for batch normalization.

Type: float

residual

Whether to use a residual connection.

Type: bool
Parameters:
- input_dim (int) – The dimension of the input features.
- output_dim (int) – The dimension of the output features.
- cfg (Optional *[*Dict ]) – Configuration dictionary for initializing the generator.
- conv_kernel (int) – Kernel size for the convolutional layer (default: 3).
- conv_dilation (int) – Dilation rate for the convolutional layer (default: 1).
- conv_stride (int) – Stride for the convolutional layer (default: 9).
- pad (int) – Padding value for the convolutional layer (default: -1).
- bias (str2bool) – Whether to include bias in convolution (default: False).
- dropout (float) – Dropout rate (default: 0.0).
- batch_norm (str2bool) – Whether to use batch normalization (default: True).
- batch_norm_weight (float) – Weight for batch normalization (default: 30.0).
- residual (str2bool) – Whether to use a residual connection (default: True).
Returns: Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor], torch.Tensor]: A tuple containing the generated sample, real sample (if text is provided), intermediate features (if residual is used), and generated sample padding mask.
Raises:
- AssertionError – If the text tensor contains no non-zero elements when
- generating the real sample. –

########### Examples

Initialize the ConvGenerator

generator = ConvGenerator(input_dim=256, output_dim=128)

Forward pass through the generator

generated_sample, real_sample, inter_x, padding_mask = generator(

feats=torch.randn(10, 256, 100), text=torch.tensor([[1, 0, 0], [0, 1, 0]]), feats_padding_mask=torch.ones(10, 100).bool()

)

NOTE

The generated sample and padding mask will be based on the convolutional operations applied to the input features. The real sample is constructed based on the provided text, and it is expected that the text tensor has non-zero elements.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

bn_padded_data(feature: Tensor, padding_mask: Tensor)

Normalize the input features using batch normalization while considering

the padding mask.

This method applies batch normalization to the input feature tensor, only for the elements that are not masked by the padding mask. The elements corresponding to the padding mask are left unchanged.

Parameters:
- feature (torch.Tensor) – The input feature tensor of shape (B, C, L), where B is the batch size, C is the number of channels, and L is the length of the sequence.
- padding_mask (torch.Tensor) – A boolean tensor of shape (B, L) that indicates which elements in the feature tensor should be considered for normalization. Elements with a value of True are included in the normalization, while those with False are ignored.
Returns: The normalized feature tensor of the same shape as the input feature tensor, where the non-masked elements have been batch normalized.
Return type: torch.Tensor

########### Examples

>>> import torch
>>> bn_layer = ConvGenerator(input_dim=64, output_dim=32)
>>> features = torch.randn(10, 64, 100)
>>> padding_mask = torch.ones(10, 100, dtype=torch.bool)
>>> padding_mask[:, 10:] = False
>>> normalized_features = bn_layer.bn_padded_data(features, padding_mask)
>>> print(normalized_features.shape)
torch.Size([10, 64, 100])

forward(feats: Tensor, text: Tensor | None, feats_padding_mask: Tensor)

Perform the forward pass of the convolutional generator.

This method processes the input features and generates output samples using convolutional layers. It can optionally incorporate batch normalization and residual connections based on the initialization parameters.

Parameters:
- feats (torch.Tensor) – Input tensor of shape (batch_size, input_dim, seq_len).
- text (Optional *[*torch.Tensor ]) – Optional tensor of shape (batch_size, seq_len). Used to create a one-hot representation of the target outputs.
- feats_padding_mask (torch.Tensor) – A boolean mask of shape (batch_size, seq_len) indicating which elements are valid (True) or padded (False).
Returns: Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor], : torch.Tensor]: A tuple containing:
- generated_sample (torch.Tensor): Output tensor of shape
(batch_size, output_dim, new_seq_len) with generated samples.
- real_sample (Optional[torch.Tensor]): One-hot encoded tensor of shape (batch_size, seq_len, output_dim) for real samples. Returns None if text is None.
- inter_x (Optional[torch.Tensor]): Intermediate tensor from the residual connection. Returns None if residual is not used.
- generated_sample_padding_mask (torch.Tensor): A mask for the generated samples of shape (batch_size, new_seq_len) indicating valid elements.
Raises:
- AssertionError – If the input text tensor contains no non-zero elements
- or if the size of generated_sample_padding_mask does not match the –
- expected output shape. –

########### Examples

>>> generator = ConvGenerator(input_dim=256, output_dim=128)
>>> feats = torch.randn(32, 256, 50)  # Example input features
>>> text = torch.randint(0, 128, (32, 50))  # Example text input
>>> feats_padding_mask = torch.ones(32, 50, dtype=torch.bool)  # No padding
>>> output = generator.forward(feats, text, feats_padding_mask)
>>> generated_sample, real_sample, inter_x, mask = output

output_size()

Returns the output dimension of the convolutional generator.

This property is useful to retrieve the output size after the convolutional layers have been applied, particularly in scenarios where the generator is part of a larger model and the output dimensions need to be known for subsequent processing steps.

Returns: The output dimension of the generator.
Return type: int

########### Examples

generator = ConvGenerator(input_dim=128, output_dim=256) output_dim = generator.output_size() # output_dim will be 256