espnet2.asr.encoder.avhubert_encoder.BasicBlock

About 2 min

espnet2.asr.encoder.avhubert_encoder.BasicBlock

class espnet2.asr.encoder.avhubert_encoder.BasicBlock(inplanes, planes, stride=1, downsample=None, relu_type='relu')

Bases: Module

Basic building block for ResNet architecture.

This class implements a basic block used in the ResNet architecture, which consists of two convolutional layers with batch normalization and ReLU or PReLU activation functions. The block supports optional downsampling.

expansion

Expansion factor for the block.

Type: int

conv1

First convolutional layer.

Type: nn.Conv2d

bn1

Batch normalization after the first convolution.

Type: nn.BatchNorm2d

relu1

Activation function after the first convolution.

Type: nn.Module

conv2

Second convolutional layer.

Type: nn.Conv2d

bn2

Batch normalization after the second convolution.

Type: nn.BatchNorm2d

downsample

Downsampling layer.

Type: nn.Sequential, optional

stride

Stride value for the first convolution.

Type: int
Parameters:
- inplanes (int) – Number of input channels.
- planes (int) – Number of output channels.
- stride (int , optional) – Stride for the first convolution. Default is 1.
- downsample (nn.Sequential , optional) – Downsampling layer. Default is None.
- relu_type (str , optional) – Type of ReLU activation function. Can be “relu” or “prelu”. Default is “relu”.
Raises:Exception – If an unsupported relu_type is provided.

####### Examples

>>> block = BasicBlock(inplanes=64, planes=128, stride=2, relu_type='relu')
>>> x = torch.randn(1, 64, 32, 32)  # Example input
>>> output = block(x)
>>> output.shape
torch.Size([1, 128, 16, 16])  # Output shape after downsampling

Initialize internal Module state, shared by both nn.Module and ScriptModule.

expansion

*= 1*

forward(x)

Forward pass through the AVHubert Encoder.

This method takes input tensors for video and audio, applies necessary transformations and masking, and returns the output tensor along with the output lengths and an optional tensor.

Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors. Expected keys are:
  - ‘video’: input tensor of shape (B, 1, L, H, W)
  - ‘audio’: input tensor of shape (B, D, L)
- ilens (torch.Tensor) – A tensor of shape (B,) containing the lengths of each input sequence.
- prev_states (torch.Tensor , optional) – Previous states from the encoder, not used in the current implementation. Defaults to None.
Returns: A tuple containing:
- position embedded tensor of shape (B, T, D)
- tensor of output lengths of shape (B,)
- None (placeholder for potential future use).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:
- ValueError – If neither ‘video’ nor ‘audio’ keys are present in
- the input dictionary. –

####### Examples

>>> xs_pad = {
...     'video': torch.randn(4, 1, 100, 64, 64),
...     'audio': torch.randn(4, 104, 100)
... }
>>> ilens = torch.tensor([100, 90, 80, 70])
>>> encoder = FairseqAVHubertEncoder()
>>> output, olens, _ = encoder(xs_pad, ilens)
>>> output.shape
torch.Size([4, 100, 1024])
>>> olens.shape
torch.Size([4])