espnet2.uasr.discriminator.conv_discriminator.ConvDiscriminator

About 2 min

espnet2.uasr.discriminator.conv_discriminator.ConvDiscriminator

class espnet2.uasr.discriminator.conv_discriminator.ConvDiscriminator(input_dim: int, cfg: Dict | None = None, conv_channels: int = 384, conv_kernel: int = 8, conv_dilation: int = 1, conv_depth: int = 2, linear_emb: str2bool = False, causal: str2bool = True, max_pool: str2bool = False, act_after_linear: str2bool = False, dropout: float = 0.0, spectral_norm: str2bool = False, weight_norm: str2bool = False)

Bases: AbsDiscriminator

Convolutional discriminator for Unsupervised Automatic Speech Recognition (UASR).

This class implements a convolutional neural network (CNN) based discriminator designed for the UASR task. It utilizes various convolutional layers and configurations to process input features and produce discriminative outputs.

conv_channels

Number of channels for convolutional layers.

Type: int

conv_kernel

Size of the convolutional kernel.

Type: int

conv_dilation

Dilation rate for convolutional layers.

Type: int

conv_depth

Number of convolutional layers in the network.

Type: int

linear_emb

Whether to use a linear embedding.

Type: bool

causal

If True, applies causal convolution.

Type: bool

max_pool

If True, applies max pooling in the output layer.

Type: bool

act_after_linear

If True, applies activation after the linear layer.

Type: bool

dropout

Dropout rate for regularization.

Type: float

spectral_norm

If True, applies spectral normalization to conv layers.

Type: bool

weight_norm

If True, applies weight normalization to conv layers.

Type: bool
Parameters:
- input_dim (int) – Dimension of the input features.
- cfg (Optional *[*Dict ] , optional) – Configuration dictionary. Defaults to None.
- conv_channels (int , optional) – Number of channels for convolutional layers. Defaults to 384.
- conv_kernel (int , optional) – Size of the convolutional kernel. Defaults to 8.
- conv_dilation (int , optional) – Dilation rate for convolutional layers. Defaults to 1.
- conv_depth (int , optional) – Number of convolutional layers. Defaults to 2.
- linear_emb (str2bool , optional) – Use linear embedding. Defaults to False.
- causal (str2bool , optional) – Use causal convolution. Defaults to True.
- max_pool (str2bool , optional) – Use max pooling in output layer. Defaults to False.
- act_after_linear (str2bool , optional) – Apply activation after linear layer. Defaults to False.
- dropout (float , optional) – Dropout rate. Defaults to 0.0.
- spectral_norm (str2bool , optional) – Use spectral normalization. Defaults to False.
- weight_norm (str2bool , optional) – Use weight normalization. Defaults to False.
Returns: The output of the discriminator.
Return type: torch.Tensor

####### Examples

>>> discriminator = ConvDiscriminator(input_dim=128)
>>> input_tensor = torch.randn(32, 100, 128)  # (Batch, Time, Features)
>>> output = discriminator(input_tensor)
>>> print(output.shape)
torch.Size([32, 1])  # Output shape depends on the architecture

NOTE

The input tensor is expected to be in the shape (Batch, Time, Features). The output is processed through various convolutional layers and may undergo padding and pooling based on the configuration.

Raises:ValueError – If input_dim is not positive.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, padding_mask: Tensor | None)

Forward pass for the ConvDiscriminator.

This method processes the input tensor x through the convolutional layers defined in the network. It expects the input tensor to be in the shape of (Batch, Time, Channel) and transforms it accordingly. The method also applies padding based on the provided padding_mask if available.

Parameters:
- x (torch.Tensor) – The input tensor with shape (Batch, Time, Channel).
- padding_mask (Optional *[*torch.Tensor ]) – A tensor used to mask out certain elements of the input. It should have the shape (Batch, Time) and can be used to apply a maximum pooling or to set specific values to negative infinity.
Returns: The output tensor after passing through the network, with shape (Batch, Channel).
Return type: torch.Tensor

####### Examples

>>> discriminator = ConvDiscriminator(input_dim=128)
>>> input_tensor = torch.randn(32, 100, 128)  # (Batch, Time, Channel)
>>> output = discriminator(input_tensor, None)
>>> output.shape
torch.Size([32, 1])

NOTE

The input tensor x will be transposed to match the expected shape for convolution operations. If a padding_mask is provided, it will be used to handle padding accordingly.

Raises:
- ValueError – If the shape of x is not compatible with the expected
- input dimensions. –