espnet2.uasr.discriminator.conv_discriminator.ConvDiscriminator
espnet2.uasr.discriminator.conv_discriminator.ConvDiscriminator
class espnet2.uasr.discriminator.conv_discriminator.ConvDiscriminator(input_dim: int, cfg: Dict | None = None, conv_channels: int = 384, conv_kernel: int = 8, conv_dilation: int = 1, conv_depth: int = 2, linear_emb: str2bool = False, causal: str2bool = True, max_pool: str2bool = False, act_after_linear: str2bool = False, dropout: float = 0.0, spectral_norm: str2bool = False, weight_norm: str2bool = False)
Bases: AbsDiscriminator
Convolutional discriminator for Unsupervised Automatic Speech Recognition (UASR).
This class implements a convolutional neural network (CNN) based discriminator designed for the UASR task. It utilizes various convolutional layers and configurations to process input features and produce discriminative outputs.
conv_channels
Number of channels for convolutional layers.
- Type: int
conv_kernel
Size of the convolutional kernel.
- Type: int
conv_dilation
Dilation rate for convolutional layers.
- Type: int
conv_depth
Number of convolutional layers in the network.
- Type: int
linear_emb
Whether to use a linear embedding.
- Type: bool
causal
If True, applies causal convolution.
- Type: bool
max_pool
If True, applies max pooling in the output layer.
- Type: bool
act_after_linear
If True, applies activation after the linear layer.
- Type: bool
dropout
Dropout rate for regularization.
- Type: float
spectral_norm
If True, applies spectral normalization to conv layers.
- Type: bool
weight_norm
If True, applies weight normalization to conv layers.
Type: bool
Parameters:
- input_dim (int) – Dimension of the input features.
- cfg (Optional *[*Dict ] , optional) – Configuration dictionary. Defaults to None.
- conv_channels (int , optional) – Number of channels for convolutional layers. Defaults to 384.
- conv_kernel (int , optional) – Size of the convolutional kernel. Defaults to 8.
- conv_dilation (int , optional) – Dilation rate for convolutional layers. Defaults to 1.
- conv_depth (int , optional) – Number of convolutional layers. Defaults to 2.
- linear_emb (str2bool , optional) – Use linear embedding. Defaults to False.
- causal (str2bool , optional) – Use causal convolution. Defaults to True.
- max_pool (str2bool , optional) – Use max pooling in output layer. Defaults to False.
- act_after_linear (str2bool , optional) – Apply activation after linear layer. Defaults to False.
- dropout (float , optional) – Dropout rate. Defaults to 0.0.
- spectral_norm (str2bool , optional) – Use spectral normalization. Defaults to False.
- weight_norm (str2bool , optional) – Use weight normalization. Defaults to False.
Returns: The output of the discriminator.
Return type: torch.Tensor
####### Examples
>>> discriminator = ConvDiscriminator(input_dim=128)
>>> input_tensor = torch.randn(32, 100, 128) # (Batch, Time, Features)
>>> output = discriminator(input_tensor)
>>> print(output.shape)
torch.Size([32, 1]) # Output shape depends on the architecture
NOTE
The input tensor is expected to be in the shape (Batch, Time, Features). The output is processed through various convolutional layers and may undergo padding and pooling based on the configuration.
- Raises:ValueError – If input_dim is not positive.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x: Tensor, padding_mask: Tensor | None)
Forward pass for the ConvDiscriminator.
This method processes the input tensor x through the convolutional layers defined in the network. It expects the input tensor to be in the shape of (Batch, Time, Channel) and transforms it accordingly. The method also applies padding based on the provided padding_mask if available.
- Parameters:
- x (torch.Tensor) – The input tensor with shape (Batch, Time, Channel).
- padding_mask (Optional *[*torch.Tensor ]) – A tensor used to mask out certain elements of the input. It should have the shape (Batch, Time) and can be used to apply a maximum pooling or to set specific values to negative infinity.
- Returns: The output tensor after passing through the network, with shape (Batch, Channel).
- Return type: torch.Tensor
####### Examples
>>> discriminator = ConvDiscriminator(input_dim=128)
>>> input_tensor = torch.randn(32, 100, 128) # (Batch, Time, Channel)
>>> output = discriminator(input_tensor, None)
>>> output.shape
torch.Size([32, 1])
NOTE
The input tensor x will be transposed to match the expected shape for convolution operations. If a padding_mask is provided, it will be used to handle padding accordingly.
- Raises:
- ValueError – If the shape of x is not compatible with the expected
- input dimensions. –