espnet2.asr.encoder.avhubert_encoder.ResNet

About 2 min

espnet2.asr.encoder.avhubert_encoder.ResNet

class espnet2.asr.encoder.avhubert_encoder.ResNet(block, layers, num_classes=1000, relu_type='relu', gamma_zero=False, avg_pool_downsample=False)

Bases: Module

ResNet architecture for deep learning applications.

This class implements a ResNet architecture, which is a deep convolutional neural network that utilizes residual connections to facilitate the training of very deep networks. The ResNet is designed to learn residual mappings with reference to the layer inputs, rather than learning unreferenced functions.

layer1

The first layer of the ResNet.

Type: nn.Sequential

layer2

The second layer of the ResNet.

Type: nn.Sequential

layer3

The third layer of the ResNet.

Type: nn.Sequential

layer4

The fourth layer of the ResNet.

Type: nn.Sequential

avgpool

Adaptive average pooling layer.

Type: nn.AdaptiveAvgPool2d
Parameters:
- block (nn.Module) – The building block to use for the ResNet.
- layers (list) – A list containing the number of blocks in each layer.
- num_classes (int , optional) – Number of output classes. Defaults to 1000.
- relu_type (str , optional) – Type of ReLU activation function. Options are ‘relu’ or ‘prelu’. Defaults to ‘relu’.
- gamma_zero (bool , optional) – If True, initializes the second batch normalization layer weights to zero. Defaults to False.
- avg_pool_downsample (bool , optional) – If True, uses average pooling for downsampling. Defaults to False.

####### Examples

>>> model = ResNet(BasicBlock, [2, 2, 2, 2])
>>> input_tensor = torch.randn(1, 3, 224, 224)  # Batch size 1, 3 channels
>>> output = model(input_tensor)
>>> print(output.shape)  # Should output (1, 512)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Forward AVHubert Encoder.

This method processes input tensors for both audio and video modalities through the AVHubert encoder, applying necessary masking and fine-tuning based on the training state.

Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors for different modalities. It must include:
  - ‘video’: input tensor of shape (B, 1, L, H, W) for video data.
  - ‘audio’: input tensor of shape (B, D, L) for audio data.
- ilens (torch.Tensor) – A tensor containing the lengths of each input sequence, shape (B).
- prev_states (torch.Tensor , optional) – Previous states from the encoder, not used in the current implementation.
Returns:
- A tensor of processed features (position embedded) from the encoder, shape (B, T, D).
- A tensor representing the output lengths, shape (B).
- None (not used in the current implementation).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If neither ‘video’ nor ‘audio’ keys are present in the input dictionary.

####### Examples

>>> model = FairseqAVHubertEncoder(...)
>>> xs_pad = {
...     'video': torch.randn(8, 1, 100, 64, 64),
...     'audio': torch.randn(8, 104, 100)
... }
>>> ilens = torch.tensor([100, 100, 100, 100, 100, 100, 100, 100])
>>> output_features, output_lengths, _ = model(xs_pad, ilens)

NOTE

The method incorporates a masking mechanism for training data augmentation and adjusts parameters based on the fine-tuning state.
Ensure that the input tensor shapes are compatible with the model configuration.