espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d

About 5 min

espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d

class espnet2.asr_transducer.encoder.blocks.conv1d.Conv1d(input_size: int, output_size: int, kernel_size: int | Tuple, stride: int | Tuple = 1, dilation: int | Tuple = 1, groups: int | Tuple = 1, bias: bool = True, batch_norm: bool = False, relu: bool = True, causal: bool = False, dropout_rate: float = 0.0)

Bases: Module

Conv1d block for Transducer encoder.

This class defines a 1D convolutional layer that can be used as a building block for the Transducer encoder architecture. It supports various configurations, including causal convolution, batch normalization, and activation functions.

input_size

The input dimension of the Conv1d layer.

Type: int

output_size

The output dimension of the Conv1d layer.

Type: int

kernel_size

Size of the convolving kernel.

Type: Union[int, Tuple]

stride

Stride of the convolution.

Type: Union[int, Tuple]

dilation

Spacing between the kernel points.

Type: Union[int, Tuple]

groups

Number of blocked connections from input channels to output channels.

Type: Union[int, Tuple]

bias

Whether to add a learnable bias to the output.

Type: bool

batch_norm

Whether to use batch normalization after convolution.

Type: bool

relu

Whether to use a ReLU activation after convolution.

Type: bool

causal

Whether to use causal convolution (set to True if streaming).

Type: bool

dropout_rate

Dropout rate for regularization.

Type: float
Parameters:
- input_size – Input dimension.
- output_size – Output dimension.
- kernel_size – Size of the convolving kernel.
- stride – Stride of the convolution.
- dilation – Spacing between the kernel points.
- groups – Number of blocked connections from input channels to output channels.
- bias – Whether to add a learnable bias to the output.
- batch_norm – Whether to use batch normalization after convolution.
- relu – Whether to use a ReLU activation after convolution.
- causal – Whether to use causal convolution (set to True if streaming).
- dropout_rate – Dropout rate.

############### Examples

>>> conv_layer = Conv1d(
...     input_size=128,
...     output_size=64,
...     kernel_size=3,
...     stride=1,
...     batch_norm=True,
...     relu=True
... )
>>> x = torch.randn(32, 100, 128)  # (B, T, D_in)
>>> pos_enc = torch.randn(32, 198, 128)  # (B, 2 * (T - 1), D_in)
>>> output, mask, pos_enc_out = conv_layer(x, pos_enc)

Raises:ValueError – If the input dimensions do not match the expected shape.

########## NOTE This module uses the PyTorch framework and is designed for efficient processing of sequential data.

Construct a Conv1d object.

chunk_forward(x: Tensor, pos_enc: Tensor, mask: Tensor, left_context: int = 0) → Tuple[Tensor, Tensor]

Encode chunk of input sequence.

This method processes a chunk of input sequences through the Conv1d module, allowing for the incorporation of previous context via caching. It is particularly useful for streaming applications where only a portion of the input is available at a time.

Parameters:
- x – Conv1d input sequences. Shape (B, T, D_in) where B is the batch size, T is the sequence length, and D_in is the input dimension.
- pos_enc – Positional embedding sequences. Shape (B, 2 * (T - 1), D_in).
- mask – Source mask. Shape (B, T).
- left_context – Number of previous frames the attention module can see in current chunk (not used here).
Returns: Conv1d output sequences. Shape (B, T, D_out) where D_out is the : output dimension.
pos_enc: Positional embedding sequences. Shape (B, 2 * (T - 1), D_out).
Return type: x

############### Examples

>>> conv1d_layer = Conv1d(input_size=64, output_size=128,
...                        kernel_size=3)
>>> input_tensor = torch.randn(32, 10, 64)  # (B, T, D_in)
>>> pos_embedding = torch.randn(32, 18, 64)  # (B, 2*(T-1), D_in)
>>> mask_tensor = torch.ones(32, 10)          # (B, T)
>>> output, new_pos_enc = conv1d_layer.chunk_forward(
...     input_tensor, pos_embedding, mask_tensor
... )

########## NOTE The left_context parameter is included for compatibility with streaming applications but is not utilized in this implementation.

create_new_mask(mask: Tensor) → Tensor

Create new mask for output sequences.

This method generates a new mask based on the input mask, which is adjusted according to the padding and stride properties of the convolutional layer. The output mask will reflect the dimensions of the output sequences after convolution.

Parameters:mask – Mask of input sequences. Shape: (B, T), where B is the batch size and T is the length of the input sequence.
Returns: Mask of output sequences. Shape: (B, sub(T)), where : sub(T) is the length of the output sequence after applying the convolution and stride.
Return type: mask

############### Examples

>>> conv_layer = Conv1d(input_size=16, output_size=32, kernel_size=3)
>>> input_mask = torch.tensor([[1, 1, 1, 0, 0],
...                             [1, 1, 1, 1, 0]])
>>> output_mask = conv_layer.create_new_mask(input_mask)
>>> print(output_mask)
tensor([[1, 0],
        [1, 0]])

########## NOTE The method assumes that the padding has already been set during the initialization of the Conv1d class.

create_new_pos_enc(pos_enc: Tensor) → Tensor

Create new positional embedding vector.

This method generates a new positional embedding based on the input sequences’ positional embeddings. It handles padding and applies the stride to the embeddings to create an output suitable for the convolutional operation.

Parameters:pos_enc – Input sequences positional embedding. Shape: (B, 2 * (T - 1), D_in)
Returns: Output sequences positional embedding. : Shape: (B, 2 * (sub(T) - 1), D_in)
Return type: pos_enc

############### Examples

>>> import torch
>>> pos_enc = torch.randn(4, 10, 16)  # Example input
>>> conv1d_layer = Conv1d(input_size=16, output_size=32, kernel_size=3)
>>> new_pos_enc = conv1d_layer.create_new_pos_enc(pos_enc)
>>> new_pos_enc.shape
torch.Size([4, 6, 16])  # Output shape may vary based on padding and stride

########## NOTE The method considers the input’s padding and applies the stride to ensure the output positional embeddings align with the output sequences generated by the convolutional layer.

forward(x: Tensor, pos_enc: Tensor, mask: Tensor | None = None, chunk_mask: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor]

Encode input sequences.

This method applies a 1D convolution to the input tensor x, followed by optional batch normalization, dropout, and ReLU activation. It processes the input sequences in a manner that can accommodate causal convolution if specified.

Parameters:
- x – Conv1d input sequences of shape (B, T, D_in), where B is the
- size (batch)
- length (T is the sequence)
- input (and D_in is the)
- dimension.
- pos_enc – Positional embedding sequences of shape (B, 2 * (T - 1), D_in).
- mask – Optional source mask of shape (B, T) that indicates which elements of x should be attended to.
- chunk_mask – Optional chunk mask of shape (T_2, T_2) for chunk-based processing (not used in this method).
Returns: Conv1d output sequences of shape (B, sub(T), D_out), where D_out is the output dimension. mask: Updated source mask of shape (B, T) or (B, sub(T)),
depending on whether padding was applied.
pos_enc: Updated positional embedding sequences, with shape : (B, 2 * (T - 1), D_att) or (B, 2 * (sub(T) - 1), D_out), depending on the output dimension.
Return type: x

############### Examples

>>> conv1d_layer = Conv1d(input_size=16, output_size=32, kernel_size=3)
>>> x = torch.randn(8, 10, 16)  # Batch of 8, sequence length of 10
>>> pos_enc = torch.randn(8, 18, 16)  # Positional encodings
>>> mask = torch.ones(8, 10)  # Full attention
>>> output, updated_mask, updated_pos_enc = conv1d_layer.forward(x, pos_enc, mask)

########## NOTE The method supports both causal and non-causal convolutions. If causal is set to True, it modifies the input x by padding it to preserve the order of the sequences.

Raises:ValueError – If the input tensor x or positional embeddings pos_enc have incompatible dimensions.

reset_streaming_cache(left_context: int, device: device) → None

Initialize/Reset Conv1d cache for streaming.

This method initializes or resets the cache used for streaming in the Conv1d module. The cache holds previous frames, allowing for efficient processing of sequential data in a streaming fashion.

Parameters:
- left_context – Number of previous frames the attention module can see in current chunk (not used here).
- device – Device to use for cache tensor, which allows for computation on the specified hardware (e.g., CPU or GPU).

############### Examples

>>> conv1d = Conv1d(input_size=64, output_size=128, kernel_size=3)
>>> conv1d.reset_streaming_cache(left_context=1, device=torch.device('cpu'))
>>> print(conv1d.cache.shape)
torch.Size([1, 64, 2])  # Example shape based on kernel_size=3

########## NOTE The cache is initialized with zeros, and its shape is based on the input size and kernel size. It is essential for maintaining context in streaming scenarios.