espnet2.asr.encoder.avhubert_encoder.BasicBlock
espnet2.asr.encoder.avhubert_encoder.BasicBlock
class espnet2.asr.encoder.avhubert_encoder.BasicBlock(inplanes, planes, stride=1, downsample=None, relu_type='relu')
Bases: Module
Basic building block for ResNet architecture.
This class implements a basic block used in the ResNet architecture, which consists of two convolutional layers with batch normalization and ReLU or PReLU activation functions. The block supports optional downsampling.
expansion
Expansion factor for the block.
- Type: int
conv1
First convolutional layer.
- Type: nn.Conv2d
bn1
Batch normalization after the first convolution.
- Type: nn.BatchNorm2d
relu1
Activation function after the first convolution.
- Type: nn.Module
conv2
Second convolutional layer.
- Type: nn.Conv2d
bn2
Batch normalization after the second convolution.
- Type: nn.BatchNorm2d
downsample
Downsampling layer.
- Type: nn.Sequential, optional
stride
Stride value for the first convolution.
Type: int
Parameters:
- inplanes (int) – Number of input channels.
- planes (int) – Number of output channels.
- stride (int , optional) – Stride for the first convolution. Default is 1.
- downsample (nn.Sequential , optional) – Downsampling layer. Default is None.
- relu_type (str , optional) – Type of ReLU activation function. Can be “relu” or “prelu”. Default is “relu”.
Raises:Exception – If an unsupported relu_type is provided.
####### Examples
>>> block = BasicBlock(inplanes=64, planes=128, stride=2, relu_type='relu')
>>> x = torch.randn(1, 64, 32, 32) # Example input
>>> output = block(x)
>>> output.shape
torch.Size([1, 128, 16, 16]) # Output shape after downsampling
Initialize internal Module state, shared by both nn.Module and ScriptModule.
expansion
forward(x)
Forward pass through the AVHubert Encoder.
This method takes input tensors for video and audio, applies necessary transformations and masking, and returns the output tensor along with the output lengths and an optional tensor.
- Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors. Expected keys are:
- ‘video’: input tensor of shape (B, 1, L, H, W)
- ‘audio’: input tensor of shape (B, D, L)
- ilens (torch.Tensor) – A tensor of shape (B,) containing the lengths of each input sequence.
- prev_states (torch.Tensor , optional) – Previous states from the encoder, not used in the current implementation. Defaults to None.
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors. Expected keys are:
- Returns: A tuple containing:
- position embedded tensor of shape (B, T, D)
- tensor of output lengths of shape (B,)
- None (placeholder for potential future use).
- Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
- Raises:
- ValueError – If neither ‘video’ nor ‘audio’ keys are present in
- the input dictionary. –
####### Examples
>>> xs_pad = {
... 'video': torch.randn(4, 1, 100, 64, 64),
... 'audio': torch.randn(4, 104, 100)
... }
>>> ilens = torch.tensor([100, 90, 80, 70])
>>> encoder = FairseqAVHubertEncoder()
>>> output, olens, _ = encoder(xs_pad, ilens)
>>> output.shape
torch.Size([4, 100, 1024])
>>> olens.shape
torch.Size([4])