espnet2.asr.encoder.avhubert_encoder.SubModel

About 1 min

espnet2.asr.encoder.avhubert_encoder.SubModel

class espnet2.asr.encoder.avhubert_encoder.SubModel(resnet=None, input_dim=None, cfg=None)

Bases: Module

SubModel for audio and video feature extraction in AVHubert.

This class implements a submodule of the AVHubert model that can process audio and video features. It uses an optional ResNet for video processing and a linear projection for both modalities.

resnet

ResNet module for video feature extraction.

Type: nn.Module or None

proj

Linear layer for projecting input features to the encoder embedding dimension.

Type: nn.Linear

encoder

Transformer encoder for further processing of features if specified.

Type:TransformerEncoder or None
Parameters:
- resnet (nn.Module or None) – A ResNet model for video feature extraction.
- input_dim (int) – The dimension of the input features.
- cfg (AVHubertConfig) – Configuration object containing model parameters.

####### Examples

>>> # Create a SubModel instance
>>> sub_model = SubModel(resnet=my_resnet, input_dim=256, cfg=my_cfg)
>>> # Forward pass through the model
>>> output = sub_model(input_tensor)

NOTE

The input tensor should have dimensions that match the expected input shape for the ResNet and the linear projection.

Raises:
- ValueError – If the input tensor shape does not match the expected
- dimensions. –

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Forward AVHubert Encoder.

This method processes the input tensors for both audio and video modalities, applying necessary masking and encoding operations. It returns the encoded output along with the corresponding lengths and an optional mask.

Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing input tensors.
  - “video”: input tensor of shape (B, 1, L, H, W) for video.
  - “audio”: input tensor of shape (B, D, L) for audio.
- ilens (torch.Tensor) – A tensor containing the input lengths of shape (B,).
- prev_states (torch.Tensor , optional) – Previous states; currently not used. Defaults to None.
Returns:
- Encoded output tensor of shape (B, T, D) where T is the sequence length
after encoding.
- Lengths of the output sequences as a tensor of shape (B,).
- An optional mask tensor if applicable, otherwise None.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If neither “video” nor “audio” is present in xs_pad.

####### Examples

>>> encoder = FairseqAVHubertEncoder(...)
>>> xs_pad = {
...     "video": torch.randn(2, 1, 50, 64, 64),
...     "audio": torch.randn(2, 104, 50)
... }
>>> ilens = torch.tensor([50, 50])
>>> output, olens, mask = encoder.forward(xs_pad, ilens)
>>> print(output.shape)  # Output: torch.Size([2, T, D])
>>> print(olens)         # Output: tensor of lengths