espnet2.asr.encoder.avhubert_encoder.FairseqAVHubertEncoder

About 4 min

espnet2.asr.encoder.avhubert_encoder.FairseqAVHubertEncoder

class espnet2.asr.encoder.avhubert_encoder.FairseqAVHubertEncoder(input_size: int = 1, avhubert_url: str = './', avhubert_dir_path: str = './', freeze_finetune_updates: int = 0, encoder_embed_dim: int = 1024, encoder_layerdrop: float = 0.05, dropout_input: float = 0.1, dropout_features: float = 0.1, dropout: float = 0.1, attention_dropout: float = 0.1, feature_grad_mult: float = 0.1, activation_dropout: float = 0.0, wav_input: bool = False, layer_norm_first: bool = True, audio_feat_dim: int = 104, encoder_layers: int = 24, encoder_ffn_embed_dim: int = 4096, encoder_attention_heads: int = 16, extracted: bool = False, pretrain: bool = True, modality_dropout: float = 0.0, audio_dropout: float = 0.0, noise_augmentation: bool = False, noise_path: str = './data/babble_noise.pt', max_noise_weight: float = 0.5, audio_only: bool = False)

Bases: AbsEncoder

FairSeq AVHubert pretrained encoder module.

This class implements a pretrained encoder for audio-visual (AV) representation learning using the AVHubert model. It extends the AbsEncoder class and integrates both audio and video modalities for feature extraction.

input_size

The dimension of the input features.

Type: int

avhubert_url

URL for downloading the pretrained AVHubert model.

Type: str

avhubert_dir_path

Directory path for storing the downloaded model.

Type: str

extracted

Indicates if the model is in the extracted state.

Type: bool

modality_dropout

Dropout rate for modality features.

Type: float

audio_dropout

Dropout rate for audio features.

Type: float

audio_only

If True, only audio features are processed.

Type: bool
Parameters:
- input_size (int) – Input dimension for the encoder. Defaults to 1.
- avhubert_url (str) – Download link for the pretrained AVHubert model. Defaults to “./”.
- avhubert_dir_path (str) – Directory path for the downloaded model. Defaults to “./”.
- freeze_finetune_updates (int) – Number of updates to freeze finetuning. Defaults to 0.
- encoder_embed_dim (int) – Dimension of the encoder embeddings. Defaults to 1024.
- encoder_layerdrop (float) – Dropout probability for encoder layers. Defaults to 0.05.
- dropout_input (float) – Dropout probability for input features. Defaults to 0.1.
- dropout_features (float) – Dropout probability for feature extraction. Defaults to 0.1.
- dropout (float) – Dropout probability in the encoder. Defaults to 0.1.
- attention_dropout (float) – Dropout probability for attention weights. Defaults to 0.1.
- feature_grad_mult (float) – Gradient multiplier for feature extractor. Defaults to 0.1.
- activation_dropout (float) – Dropout probability after activation. Defaults to 0.0.
- wav_input (bool) – If True, indicates that input is audio waveform. Defaults to False.
- layer_norm_first (bool) – If True, applies layer normalization first. Defaults to True.
- audio_feat_dim (int) – Dimension of audio features. Defaults to 104.
- encoder_layers (int) – Number of encoder layers. Defaults to 24.
- encoder_ffn_embed_dim (int) – Dimension of the FFN embeddings. Defaults to 4096.
- encoder_attention_heads (int) – Number of attention heads in the encoder. Defaults to 16.
- extracted (bool) – Indicates if features are extracted. Defaults to False.
- pretrain (bool) – If True, uses pretrained model weights. Defaults to True.
- modality_dropout (float) – Dropout rate for modality features. Defaults to 0.0.
- audio_dropout (float) – Dropout rate for audio features. Defaults to 0.0.
- noise_augmentation (bool) – If True, applies noise augmentation. Defaults to False.
- noise_path (str) – Path to the noise data for augmentation. Defaults to “./data/babble_noise.pt”.
- max_noise_weight (float) – Maximum weight for noise in augmentation. Defaults to 0.5.
- audio_only (bool) – If True, only processes audio stream. Defaults to False.
Returns: None
Raises:ValueError – If input does not contain video or audio data.

############# Examples

encoder = FairseqAVHubertEncoder(input_size=1, avhubert_url=”path/to/model”) audio_input = torch.randn(8, 104, 100) # (batch_size, feature_dim, length) video_input = torch.randn(8, 1, 10, 224, 224) # (batch_size, 1, T, H, W) inputs = {“audio”: audio_input, “video”: video_input} lengths = torch.tensor([100, 90, 80, 70, 60, 50, 40, 30]) # Example lengths output, olens, _ = encoder(inputs, lengths)

######## NOTE The AVHubert model is based on the architecture described in the paper: “Self-Supervised Learning of Audio-Visual Representations”. Ensure to set the correct input shapes for audio and video data.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Dict[str, Tensor], ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Forward AVHubert Encoder.

This method processes the input tensors for audio and video modalities through the AVHubert encoder, applying necessary transformations and masking. It returns the encoded features along with the output lengths and an optional tensor for further processing.

Parameters:
- xs_pad (Dict *[*str , torch.Tensor ]) – A dictionary containing the input tensors:
  - ‘video’: input tensor of shape (B, 1, L, H, W)
  - ‘audio’: input tensor of shape (B, D, L)
- ilens (torch.Tensor) – A tensor of shape (B,) representing the lengths of the input sequences for each batch.
- prev_states (torch.Tensor , optional) – Not used in the current version.
Returns:
- Encoded features tensor of shape (B, T, D).
- A tensor of output lengths for each input sequence of shape (B,).
- An optional tensor that can be used for further processing, currently set to None.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If neither ‘video’ nor ‘audio’ keys are present in xs_pad.

############# Examples

>>> encoder = FairseqAVHubertEncoder()
>>> audio_input = torch.randn(2, 104, 50)  # (B, D, L)
>>> video_input = torch.randn(2, 1, 50, 224, 224)  # (B, 1, L, H, W)
>>> ilens = torch.tensor([50, 50])
>>> xs_pad = {'audio': audio_input, 'video': video_input}
>>> encoded_features, output_lengths, _ = encoder(xs_pad, ilens)
>>> print(encoded_features.shape)  # Output: (2, T, D)
>>> print(output_lengths.shape)  # Output: (2,)

######## NOTE This function supports both training and inference modes. During training, additional augmentations like time masking and modality dropout are applied.

forward_fusion(xs_pad: Dict[str, Tensor]) → Tensor

Fuses audio and video features extracted from the encoder.

This method takes in a dictionary containing audio and video features, processes them through their respective modality encoders, and then fuses the results using the specified fusion method (concatenation or addition).

Parameters:xs_pad (Dict *[*str , torch.Tensor ]) –
A dictionary containing:
- ‘audio’ (torch.Tensor): Audio features tensor of shape
(B, D, L), where B is the batch size, D is the number of audio features, and L is the sequence length.
- ’video’ (torch.Tensor): Video features tensor of shape (B, 1, L, H, W), where H is the height and W is the width of the video frames.
Returns: The fused features tensor. The shape of the returned tensor depends on the fusion method used:
- If concatenation, shape will be (B, D * 2, L).
- If addition, shape will be (B, D, L).
Return type: torch.Tensor

############# Examples

>>> audio_input = torch.randn(4, 128, 10)  # Batch of 4 audio samples
>>> video_input = torch.randn(4, 1, 10, 224, 224)  # Batch of 4 video samples
>>> encoder = FairseqAVHubertEncoder()
>>> fused_features = encoder.forward_fusion({
...     'audio': audio_input,
...     'video': video_input
... })
>>> print(fused_features.shape)
torch.Size([4, 256, 10])  # If concatenation is used

######## NOTE The audio and video features must be preprocessed and extracted before calling this method. If either audio or video features are not provided, the method will handle it gracefully by creating zero tensors for the missing modality.

Raises:ValueError – If both audio and video features are None.

output_size() → int

Get the output size of the AVHubert encoder.

This method returns the dimensionality of the output from the encoder, which is defined during the initialization of the encoder.

Returns: The output size of the encoder, which corresponds to the embedding dimension specified during initialization.
Return type: int

############# Examples

>>> encoder = FairseqAVHubertEncoder(encoder_embed_dim=512)
>>> encoder.output_size()
512

######## NOTE The output size is primarily determined by the encoder_embed_dim parameter passed to the encoder during its construction.

reload_pretrained_parameters()

Reload the pretrained parameters into the encoder.

This method allows the user to restore the original pretrained parameters of the AVHubert encoder. It is particularly useful in scenarios where the model has undergone fine-tuning and the user wants to revert to the initial state of the model.

The pretrained parameters are loaded from the self.pretrained_params attribute, which is a deep copy of the model’s state dictionary at initialization.

Returns: None

############# Examples

Create an instance of the encoder

encoder = FairseqAVHubertEncoder()

Fine-tune the encoder

…

Reload pretrained parameters

encoder.reload_pretrained_parameters()