espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput

About 2 min

espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput

class espnet2.asr_transducer.encoder.blocks.conv_input.ConvInput(input_size: int, conv_size: int | Tuple, subsampling_factor: int = 4, vgg_like: bool = True, output_size: int | None = None)

Bases: Module

ConvInput block for Transducer encoder.

This module defines a convolutional input block used in the Transducer encoder. It processes input sequences through a series of convolutional and pooling layers, optionally following a VGG-like architecture.

subsampling_factor

The factor by which the input sequence length is reduced.

Type: int

vgg_like

Indicates if a VGG-like architecture is used.

Type: bool

min_frame_length

The minimum frame length based on the subsampling factor.

Type: int

output_size

The size of the output dimension after processing.

Type: Optional[int]
Parameters:
- input_size (int) – Size of the input feature dimension.
- conv_size (Union *[*int , Tuple *[*int ] ]) – Size of the convolutional layers.
- subsampling_factor (int , optional) – Factor for subsampling (default is 4).
- vgg_like (bool , optional) – Flag to use VGG-like architecture (default is True).
- output_size (Optional *[*int ] , optional) – Output dimension size (default is None).

####### Examples

>>> conv_input = ConvInput(input_size=80, conv_size=(64, 128))
>>> input_tensor = torch.randn(32, 100, 80)  # (B, T, D_feats)
>>> output, mask = conv_input(input_tensor)

Raises:ValueError – If conv_size does not match the expected format based on whether vgg_like is True or False.

NOTE

The architecture is designed to handle both VGG-like structures and standard convolutional structures based on the input parameters.

Construct a ConvInput object.

forward(x: Tensor, mask: Tensor | None = None) → Tuple[Tensor, Tensor]

ConvInput block for Transducer encoder.

This module defines a convolutional input layer for the Transducer encoder. It processes input sequences through convolutional layers and may apply subsampling and pooling operations depending on the specified configuration.

subsampling_factor

The factor by which the input is subsampled.

Type: int

vgg_like

Indicates whether a VGG-like architecture is used.

Type: bool

output_size

The output dimension of the block. If None, it will be determined based on the convolutional output.

Type: Optional[int]
Parameters:
- input_size (int) – Size of the input feature vector.
- conv_size (Union *[*int , Tuple ]) – Size of the convolutional layers. If using a VGG-like network, should be a tuple specifying sizes for two convolutional layers.
- subsampling_factor (int , optional) – Factor by which to subsample the input. Default is 4.
- vgg_like (bool , optional) – Whether to use a VGG-like architecture. Default is True.
- output_size (Optional *[*int ] , optional) – The desired output dimension. If None, it is inferred from the convolutional layers.

####### Examples

>>> conv_input = ConvInput(input_size=128, conv_size=(64, 128))
>>> x = torch.randn(32, 10, 128)  # (Batch size, Time steps, Features)
>>> mask = torch.ones(32, 1, 10)   # (Batch size, 1, Time steps)
>>> output, output_mask = conv_input(x, mask)
>>> print(output.shape)  # Should reflect the shape after conv layers and subsampling

Returns:
- x (torch.Tensor): Output sequences after convolution. Shape is
(B, sub(T), D_out).
- mask (Optional[torch.Tensor]): Mask of the output sequences. Shape is (B, 1, sub(T)) if mask is provided, otherwise None.
Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If the input tensor does not match the expected dimensions.