espnet2.asr.encoder.avhubert_encoder.ResNet
espnet2.asr.encoder.avhubert_encoder.ResNet
class espnet2.asr.encoder.avhubert_encoder.ResNet(block, layers, num_classes=1000, relu_type='relu', gamma_zero=False, avg_pool_downsample=False)
Bases: Module
ResNet architecture for deep learning applications.
This class implements a ResNet architecture, which is a deep convolutional neural network that utilizes residual connections to facilitate the training of very deep networks. The ResNet is designed to learn residual mappings with reference to the layer inputs, rather than learning unreferenced functions.
layer1
The first layer of the ResNet.
- Type: nn.Sequential
layer2
The second layer of the ResNet.
- Type: nn.Sequential
layer3
The third layer of the ResNet.
- Type: nn.Sequential
layer4
The fourth layer of the ResNet.
- Type: nn.Sequential
avgpool
Adaptive average pooling layer.
Type: nn.AdaptiveAvgPool2d
Parameters:
- block (nn.Module) – The building block to use for the ResNet.
- layers (list) – A list containing the number of blocks in each layer.
- num_classes (int , optional) – Number of output classes. Defaults to 1000.
- relu_type (str , optional) – Type of ReLU activation function. Options are ‘relu’ or ‘prelu’. Defaults to ‘relu’.
- gamma_zero (bool , optional) – If True, initializes the second batch normalization layer weights to zero. Defaults to False.
- avg_pool_downsample (bool , optional) – If True, uses average pooling for downsampling. Defaults to False.
####### Examples
>>> model = ResNet(BasicBlock, [2, 2, 2, 2])
>>> input_tensor = torch.randn(1, 3, 224, 224) # Batch size 1, 3 channels
>>> output = model(input_tensor)
>>> print(output.shape) # Should output (1, 512)
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x)
Forward AVHubert Encoder.
This method processes input tensors for both audio and video modalities through the AVHubert encoder, applying necessary masking and fine-tuning based on the training state.
- Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors for different modalities. It must include:
- ‘video’: input tensor of shape (B, 1, L, H, W) for video data.
- ‘audio’: input tensor of shape (B, D, L) for audio data.
- ilens (torch.Tensor) – A tensor containing the lengths of each input sequence, shape (B).
- prev_states (torch.Tensor , optional) – Previous states from the encoder, not used in the current implementation.
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors for different modalities. It must include:
- Returns:
- A tensor of processed features (position embedded) from the encoder, shape (B, T, D).
- A tensor representing the output lengths, shape (B).
- None (not used in the current implementation).
- Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
- Raises:ValueError – If neither ‘video’ nor ‘audio’ keys are present in the input dictionary.
####### Examples
>>> model = FairseqAVHubertEncoder(...)
>>> xs_pad = {
... 'video': torch.randn(8, 1, 100, 64, 64),
... 'audio': torch.randn(8, 104, 100)
... }
>>> ilens = torch.tensor([100, 100, 100, 100, 100, 100, 100, 100])
>>> output_features, output_lengths, _ = model(xs_pad, ilens)
NOTE
- The method incorporates a masking mechanism for training data augmentation and adjusts parameters based on the fine-tuning state.
- Ensure that the input tensor shapes are compatible with the model configuration.