espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput
espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput
class espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput(input_size: int, conv_size: int | Tuple, subsampling_factor: int = 4, vgg_like: bool = True, output_size: int | None = None)
Bases: Module
ConvInput block for Transducer encoder.
This module defines a convolutional input block used in the Transducer encoder. It processes input sequences through a series of convolutional and pooling layers, optionally following a VGG-like architecture.
subsampling_factor
The factor by which the input sequence length is reduced.
- Type: int
vgg_like
Indicates if a VGG-like architecture is used.
- Type: bool
min_frame_length
The minimum frame length based on the subsampling factor.
- Type: int
output_size
The size of the output dimension after processing.
Type: Optional[int]
Parameters:
- input_size (int) – Size of the input feature dimension.
- conv_size (Union *[*int , Tuple *[*int ] ]) – Size of the convolutional layers.
- subsampling_factor (int , optional) – Factor for subsampling (default is 4).
- vgg_like (bool , optional) – Flag to use VGG-like architecture (default is True).
- output_size (Optional *[*int ] , optional) – Output dimension size (default is None).
####### Examples
>>> conv_input = ConvInput(input_size=80, conv_size=(64, 128))
>>> input_tensor = torch.randn(32, 100, 80) # (B, T, D_feats)
>>> output, mask = conv_input(input_tensor)
- Raises:ValueError – If conv_size does not match the expected format based on whether vgg_like is True or False.
NOTE
The architecture is designed to handle both VGG-like structures and standard convolutional structures based on the input parameters.
Construct a ConvInput object.
forward(x: Tensor, mask: Tensor | None = None) → Tuple[Tensor, Tensor]
ConvInput block for Transducer encoder.
This module defines a convolutional input layer for the Transducer encoder. It processes input sequences through convolutional layers and may apply subsampling and pooling operations depending on the specified configuration.
subsampling_factor
The factor by which the input is subsampled.
- Type: int
vgg_like
Indicates whether a VGG-like architecture is used.
- Type: bool
output_size
The output dimension of the block. If None, it will be determined based on the convolutional output.
Type: Optional[int]
Parameters:
- input_size (int) – Size of the input feature vector.
- conv_size (Union *[*int , Tuple ]) – Size of the convolutional layers. If using a VGG-like network, should be a tuple specifying sizes for two convolutional layers.
- subsampling_factor (int , optional) – Factor by which to subsample the input. Default is 4.
- vgg_like (bool , optional) – Whether to use a VGG-like architecture. Default is True.
- output_size (Optional *[*int ] , optional) – The desired output dimension. If None, it is inferred from the convolutional layers.
####### Examples
>>> conv_input = ConvInput(input_size=128, conv_size=(64, 128))
>>> x = torch.randn(32, 10, 128) # (Batch size, Time steps, Features)
>>> mask = torch.ones(32, 1, 10) # (Batch size, 1, Time steps)
>>> output, output_mask = conv_input(x, mask)
>>> print(output.shape) # Should reflect the shape after conv layers and subsampling
Returns:
- x (torch.Tensor): Output sequences after convolution. Shape is
(B, sub(T), D_out).
- mask (Optional[torch.Tensor]): Mask of the output sequences. Shape is (B, 1, sub(T)) if mask is provided, otherwise None.
Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If the input tensor does not match the expected dimensions.