espnet2.asr.encoder.avhubert_encoder.AVHubertModel

About 6 min

espnet2.asr.encoder.avhubert_encoder.AVHubertModel

class espnet2.asr.encoder.avhubert_encoder.AVHubertModel(cfg: AVHubertConfig, **kwargs)

Bases: Module

AVHubert model for audio-visual representation learning.

This model is based on the AVHubert architecture and is designed for processing both audio and video modalities. It leverages a transformer-based encoder to extract features from input audio and video data, which can then be used for various downstream tasks.

feature_extractor_audio

A sub-model for extracting audio features.

feature_extractor_video

A sub-model for extracting video features.

modality_fuse

Method for fusing audio and video features (‘concat’ or ‘add’).

encoder

Transformer encoder used for feature processing.

layer_norm

Layer normalization applied to the fused features.

post_extract_proj

Optional projection layer after feature extraction.

audio_only

Boolean indicating if only audio should be processed.

Parameters:
- cfg (AVHubertConfig) – Configuration object containing model parameters.
- **kwargs – Additional keyword arguments for model initialization.

##################### Examples

Create a configuration object

cfg = AVHubertConfig()

Build the AVHubert model

model = AVHubertModel.build_model(cfg)

Forward pass with dummy audio and video inputs

audio_input = torch.randn(2, 1, 100) # Example audio input video_input = torch.randn(2, 1, 10, 224, 224) # Example video input features = model.extract_finetune({‘audio’: audio_input, ‘video’: video_input})

######### NOTE Ensure that the FairSeq library is properly installed to utilize the functionalities of this model.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

classmethod build_model(cfg: AVHubertConfig)

Build a new AVHubert model instance.

This method initializes and returns a new instance of the AVHubert model using the specified configuration parameters.

Parameters:
- cls – The class of the model to be instantiated.
- cfg (AVHubertConfig) – Configuration object containing model parameters.
Returns: An instance of the AVHubert model initialized with the given configuration.
Return type:AVHubertModel

##################### Examples

>>> config = AVHubertConfig()
>>> model_instance = AVHubertModel.build_model(config)
>>> print(type(model_instance))
&lt;class '__main__.AVHubertModel'&gt;

extract_finetune(source, padding_mask=None, mask=False, ret_conv=False, output_layer=None)

Forward AVHubert Pretrain Encoder.

This method processes audio and video inputs, applies modality fusion, and passes the features through the encoder. The function can handle both modalities, with the option to fine-tune the model.

Parameters:
- source (dict) – A dictionary containing the input tensors.
  - source[‘video’]: input tensor of shape (B, 1, L, H, W)
  - source[‘audio’]: input tensor of shape (B, F, L)
- padding_mask (torch.Tensor , optional) – A tensor of shape (B, L) indicating which elements are padding. Defaults to None.
- mask (bool , optional) – If True, applies masking to the input. Defaults to False.
- ret_conv (bool , optional) – If True, returns convolutional features. Defaults to False.
- output_layer (int , optional) – Specifies which layer’s output to return. Defaults to None, meaning all layers.
Returns: A tuple containing: : - encoded tensor of shape (B, T, D)
- padding mask of shape (B, T)
Return type: tuple
Raises:ValueError – If both audio and video sources are None.

##################### Examples

>>> source = {
...     'video': torch.randn(4, 1, 100, 224, 224),
...     'audio': torch.randn(4, 80, 100)
... }
>>> padding_mask = torch.tensor([[0, 1, 1, 0, 0],
...                                [0, 0, 0, 1, 1]])
>>> encoded, mask = model.extract_finetune(source,
...                                         padding_mask)

forward_audio(source_audio)

Forward pass for audio input through the AVHubert model.

This method processes the audio input tensor and extracts features using the audio feature extractor. The features are computed without tracking gradients to reduce memory usage during inference.

Parameters:source_audio (torch.Tensor) – Input tensor containing audio data of shape (B, F, T), where B is the batch size, F is the number of features, and T is the sequence length.
Returns: Extracted audio features of shape (B, D, T), where : D is the encoder embedding dimension.
Return type: torch.Tensor

##################### Examples

>>> model = AVHubertModel(cfg)
>>> audio_input = torch.randn(8, 512, 100)  # Batch of 8, 512 features, 100 time steps
>>> audio_features = model.forward_audio(audio_input)
>>> print(audio_features.shape)
torch.Size([8, 768, 100])  # Assuming encoder_embed_dim is 768

######### NOTE This method is primarily intended for use during inference and should not be used during training as it does not track gradients.

forward_features(source: Tensor, modality: str) → Tensor

Extract features from the input source tensor using the specified modality.

This method utilizes the appropriate feature extractor (either audio or video) based on the provided modality string. If feature_grad_mult is greater than zero, it applies a gradient scaling factor during backpropagation.

Parameters:
- source (torch.Tensor) – Input tensor containing audio or video data. The shape of the tensor should be compatible with the feature extractor corresponding to the specified modality.
- modality (str) – A string that specifies the modality type. It should be either “audio” or “video”.
Returns: The extracted features from the input source tensor.
Return type: torch.Tensor

##################### Examples

>>> model = AVHubertModel(cfg)
>>> audio_input = torch.randn(8, 1, 16000)  # Example audio input
>>> video_input = torch.randn(8, 1, 5, 224, 224)  # Example video input
>>> audio_features = model.forward_features(audio_input, "audio")
>>> video_features = model.forward_features(video_input, "video")

######### NOTE Ensure that the modality parameter matches the type of data in the source tensor to avoid runtime errors.

forward_padding_mask(features: Tensor, padding_mask: Tensor) → Tensor

Adjusts the padding mask to match the feature dimensions.

This method takes the input feature tensor and its associated padding mask, ensuring that the mask dimensions align with the features. If the padding mask is longer than the features, the extra elements are removed. The final mask is reshaped to allow for masking across the appropriate dimensions.

Parameters:
- features (torch.Tensor) – The input feature tensor with shape (B, T, D) where B is the batch size, T is the sequence length, and D is the feature dimension.
- padding_mask (torch.Tensor) – The original padding mask with shape (B, L) where L is the length of the original sequence.
Returns: A boolean tensor indicating the positions that should be masked, with shape (B, T).
Return type: torch.Tensor

##################### Examples

>>> features = torch.randn(4, 10, 64)  # Example features
>>> padding_mask = torch.tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
...                                [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
...                                [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
...                                [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])
>>> mask = self.forward_padding_mask(features, padding_mask)
>>> mask.shape
torch.Size([4, 10])  # Output mask shape should match features

forward_transformer(source, padding_mask=None, output_layer=None)

Forward AVHubert Pretrain Encoder (without frontend).

This method processes the input tensor using the transformer encoder to generate encoded features. The input tensor is expected to be a fused feature tensor, combining both audio and video modalities.

Parameters:
- source – A tensor of shape (B, L, D*2) where B is the batch size, L is the sequence length, and D is the embedding dimension.
- padding_mask – A tensor of shape (B, L) indicating padded elements in the input. Elements to be masked should have a value of True, while valid elements should have a value of False.
- output_layer – Optional integer specifying which layer’s output to return. If None, the output from the last layer is returned.
Returns:
- The encoded tensor of shape (B, L, D) after processing
through the transformer encoder.
- The updated padding mask of shape (B, L).
Return type: A tuple containing

##################### Examples

>>> model = AVHubertModel(cfg)
>>> input_tensor = torch.rand(4, 10, 768)  # Example input
>>> padding_mask = torch.zeros(4, 10, dtype=torch.bool)
>>> encoded_output, updated_mask = model.forward_transformer(input_tensor, padding_mask)

######### NOTE This function assumes that the input has already undergone necessary preprocessing steps, including modality fusion.

Raises:ValueError – If the source tensor or padding_mask has an invalid shape.

forward_video(source_video)

Forward pass for the video feature extractor.

This method processes the input video tensor and extracts features using the underlying video feature extractor model. The output features are computed without gradient tracking, which is beneficial for inference scenarios.

Parameters:source_video (torch.Tensor) – Input video tensor of shape (B, 1, L, H, W), where B is the batch size, L is the sequence length, H is the height, and W is the width of the video frames.
Returns: Extracted video features of shape (B, F, T), where F is : the number of feature dimensions and T is the length of the output sequence.
Return type: torch.Tensor

##################### Examples

>>> model = AVHubertModel(cfg)
>>> video_input = torch.randn(8, 1, 10, 224, 224)  # Batch of 8 videos
>>> video_features = model.forward_video(video_input)
>>> print(video_features.shape)  # Should print: torch.Size([8, F, T])

######### NOTE This method is intended for use in inference mode. During training, the video features are typically extracted in a manner that allows for backpropagation.

modality_fusion(features_audio, features_video)

Fuse audio and video features using the specified fusion method.

This method combines audio and video features based on the configured fusion technique, which can be either concatenation or addition. It handles cases where one of the modalities may be absent by providing zero tensors of the appropriate shape.

Parameters:
- features_audio (torch.Tensor) – The audio features tensor with shape (B, D, L), where B is the batch size, D is the feature dimension, and L is the length of the sequence.
- features_video (torch.Tensor) – The video features tensor with shape (B, D, L), where B is the batch size, D is the feature dimension, and L is the length of the sequence.
Returns: The fused features tensor, which will have shape determined by the fusion method:
- If concatenation is used, the shape will be (B, 2D, L).
- If addition is used, the shape will be (B, D, L).
Return type: torch.Tensor
Raises:ValueError – If an unknown fusion method is specified.

##################### Examples

>>> audio_features = torch.randn(32, 256, 10)  # 32 samples, 256 features, 10 time steps
>>> video_features = torch.randn(32, 256, 10)  # 32 samples, 256 features, 10 time steps
>>> fused_features = self.modality_fusion(audio_features, video_features)
>>> print(fused_features.shape)  # If concatenation is used, should output: torch.Size([32, 512, 10])