espnet2.asr.encoder.avhubert_encoder.SamePad

About 2 min

espnet2.asr.encoder.avhubert_encoder.SamePad

class espnet2.asr.encoder.avhubert_encoder.SamePad(kernel_size, causal=False)

Bases: Module

Applies same padding to the input tensor.

This class provides a way to ensure that the output tensor has the same spatial dimensions as the input tensor after applying a convolution operation with a specified kernel size. It can be configured for causal padding, which is commonly used in sequence-to-sequence models where future information should not be considered.

kernel_size

The size of the convolutional kernel.

Type: int

remove

The number of elements to remove from the end of the output tensor, determined by the kernel size.

Type: int
Parameters:
- kernel_size (int) – The size of the kernel for which the padding will be calculated.
- causal (bool) – If True, applies causal padding. Default is False.

####### Examples

>>> import torch
>>> same_pad = SamePad(kernel_size=3)
>>> input_tensor = torch.randn(1, 1, 10)  # Example input
>>> output_tensor = same_pad(input_tensor)
>>> output_tensor.shape  # Output shape should be (1, 1, 10)

>>> causal_pad = SamePad(kernel_size=3, causal=True)
>>> output_tensor_causal = causal_pad(input_tensor)
>>> output_tensor_causal.shape  # Output shape should be (1, 1, 9)

NOTE

If kernel_size is even, the padding will be applied equally on both sides. If kernel_size is odd, one extra element will be removed from the end to maintain the same output size.

Raises:ValueError – If the input tensor does not have the expected dimensions.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Forward pass for the AVHubert Encoder.

This method processes the input tensors, applies necessary transformations, and returns the encoded representations along with their respective lengths.

Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors. It can have the following keys:
  - ‘video’: input tensor of shape (B, 1, L, H, W)
  - ‘audio’: input tensor of shape (B, D, L)
- ilens (torch.Tensor) – A tensor of shape (B,) representing the input lengths for each batch element.
- prev_states (torch.Tensor , optional) – Not used in the current implementation. Defaults to None.
Returns:
- A tensor of shape (B, T, D) representing the encoded features.
- A tensor of shape (B,) containing the lengths of the output sequences.
- None, as there are no additional states returned.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If neither ‘video’ nor ‘audio’ is present in xs_pad.

####### Examples

>>> encoder = FairseqAVHubertEncoder()
>>> xs_pad = {
...     'video': torch.randn(2, 1, 50, 64, 64),
...     'audio': torch.randn(2, 104, 50)
... }
>>> ilens = torch.tensor([50, 50])
>>> output, lengths, _ = encoder.forward(xs_pad, ilens)
>>> print(output.shape)  # Output shape: (2, T, D)
>>> print(lengths)  # Output lengths for each batch item

NOTE

Ensure that the input tensors are properly padded and have the correct dimensions.