espnet2.asr_transducer.encoder.modules.convolution.DepthwiseConvolution

About 2 min

espnet2.asr_transducer.encoder.modules.convolution.DepthwiseConvolution

class espnet2.asr_transducer.encoder.modules.convolution.DepthwiseConvolution(size: int, kernel_size: int, causal: bool = False)

Bases: Module

Depth-wise Convolution module definition.

This module performs depth-wise convolution, which applies a separate convolutional filter to each input channel. It is commonly used in lightweight neural networks to reduce the number of parameters and computations.

Parameters:
- size – Initial size to determine the number of channels. The total number of channels will be size + size.
- kernel_size – Size of the convolving kernel.
- causal – Whether to use causal convolution (set to True if streaming).

kernel_size

The size of the convolutional kernel.

lorder

The length of the causal order.

conv

The 1D convolution layer that performs the depth-wise convolution.

####### Examples

>>> import torch
>>> depthwise_conv = DepthwiseConvolution(size=64, kernel_size=3)
>>> input_tensor = torch.rand(32, 10, 128)  # (B, T, D_hidden)
>>> output, cache = depthwise_conv(input_tensor)
>>> output.shape
torch.Size([32, 10, 128])  # Output shape depends on padding and input

NOTE

The input tensor x should have shape (B, T, D_hidden), where B is the batch size, T is the sequence length, and D_hidden is the number of features (channels).

Raises:ValueError – If the kernel_size is not a positive odd integer.

Construct a DepthwiseConvolution object.

forward(x: Tensor, mask: Tensor | None = None, cache: Tensor | None = None) → Tuple[Tensor, Tensor]

Compute the depthwise convolution operation.

This method performs a depthwise convolution on the input tensor x. It applies the convolution while considering the optional mask and cache. The mask is used to prevent information leakage in specific time steps, while the cache is used to maintain the state across sequential calls, useful in causal settings.

Parameters:
- x – DepthwiseConvolution input sequences with shape (B, T, D_hidden), where B is the batch size, T is the sequence length, and D_hidden is the number of hidden units.
- mask – Optional source mask with shape (B, T_2) that indicates which elements should be ignored (masked).
- cache – Optional input cache with shape (1, conv_kernel, D_hidden) used to store previous state for causal convolutions.
Returns: A tuple containing: : - x: DepthwiseConvolution output sequences with shape (B, ?, D_hidden), where ? indicates the new sequence length after the convolution operation.
- cache: DepthwiseConvolution output cache with shape (1, conv_kernel, D_hidden), which is updated based on the current input.
Return type: Tuple[torch.Tensor, torch.Tensor]

NOTE

The input tensor x is transposed to match the expected input shape for the convolution layer. If a mask is provided, it is applied to the input tensor to prevent the model from attending to certain time steps.

####### Examples

>>> model = DepthwiseConvolution(size=64, kernel_size=3)
>>> input_tensor = torch.randn(10, 5, 128)  # (B, T, D_hidden)
>>> mask = torch.ones(10, 5)  # No masking
>>> output, cache = model(input_tensor, mask=mask)
>>> output.shape  # Expected shape: (10, ?, 128)
torch.Size([10, ?, 128])