espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder

About 3 min

espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder

class espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)

Bases: AbsEncoder

FairSeq Hubert encoder module for loading pretrained weights and fine-tuning.

This class provides an interface for using the Hubert encoder architecture from FairSeq, enabling loading of pretrained models and fine-tuning for various tasks, particularly Automatic Speech Recognition (ASR).

Parameters:
- input_size (int) – Input dimension for the model.
- hubert_url (str) – URL to download the Hubert pretrained model.
- hubert_dir_path (str) – Directory to save the downloaded model.
- output_size (int) – Dimension of the output from the encoder.
- normalize_before (bool) – Whether to apply layer normalization before the first block of the encoder.
- freeze_finetune_updates (int) – Number of updates to freeze all layers except the output layer before tuning the entire model. Necessary to prevent overfitting.
- dropout_rate (float) – Dropout rate applied in the encoder.
- activation_dropout (float) – Dropout rate applied in activation functions.
- attention_dropout (float) – Dropout rate applied in the attention mechanism.

Hubert-specific Args: : For more details, please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py

########### Examples

>>> encoder = FairseqHubertEncoder(input_size=512, hubert_url="path/to/model")
>>> xs_pad = torch.randn(10, 100, 512)  # (B, L, D)
>>> ilens = torch.tensor([100] * 10)  # Lengths of input sequences
>>> outputs, olens, _ = encoder(xs_pad, ilens)

######## NOTE Ensure that the FairSeq library is installed and properly configured to use this module.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Forward pass for the Hubert ASR Encoder.

This method processes the input tensor through the Hubert encoder, returning the position-embedded tensor along with a mask. The behavior of the method depends on whether the model is in finetuning mode or not.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (B, L, D) where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – A tensor containing the lengths of the input sequences of shape (B).
- ys_pad (torch.Tensor , optional) – Target tensor of shape (B, T, C), where T is the target sequence length and C is the number of classes. Default is None.
- ys_pad_length (torch.Tensor , optional) – Lengths of the target sequences. Default is None.
- prev_states (torch.Tensor , optional) – Placeholder for previous states. Not used in the current implementation. Default is None.
Returns: A tuple containing:
- Position embedded tensor of shape (B, T, D).
- A tensor containing the lengths of the outputs of shape (B).
- An optional tensor (currently None).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

######## NOTE The method checks if the model is in finetuning mode. If it is not, it calls the _pretraining_forward method. If the model is in training mode, it calls _finetuning_forward, and otherwise, it calls _eval_forward.

########### Examples

>>> encoder = FairseqHubertEncoder(input_size=256)
>>> xs_pad = torch.randn(2, 10, 256)  # Example input tensor
>>> ilens = torch.tensor([10, 8])  # Lengths of input sequences
>>> outputs = encoder(xs_pad, ilens)
>>> print(outputs[0].shape)  # Position embedded tensor shape
>>> print(outputs[1])  # Output lengths

Raises:
- ValueError – If ys_pad is None when not in finetuning mode and
- _pretraining_forward –

output_size() → int

Returns the output size of the encoder.

This method provides the dimensionality of the output from the encoder, which is crucial for understanding the model’s representation capacity.

Returns: The output size of the encoder, typically representing the dimensionality of the final layer.
Return type: int

########### Examples

>>> encoder = FairseqHubertEncoder(input_size=256, output_size=512)
>>> encoder.output_size()
512

######## NOTE The output size is set during the initialization of the encoder and can be modified if necessary.

reload_pretrained_parameters()

Reloads the pretrained parameters into the Hubert model.

This method loads the parameters stored in self.pretrained_params back into the Hubert pretraining model. This can be useful when you want to reset the model to its initial state after fine-tuning or experimentation.

It performs the following steps: : 1. Loads the state dictionary from self.pretrained_params. <br/> 2. Logs a message indicating that the pretrained parameters have been reloaded.

########### Examples

>>> encoder = FairseqHubertEncoder(input_size=128)
>>> encoder.reload_pretrained_parameters()
Pretrained Hubert model parameters reloaded!

######## NOTE The strict argument is set to False to allow loading of parameters that may not match exactly, which can be useful if some layers were added or modified.

Raises:
- RuntimeError – If the state dictionary cannot be loaded
- due to a mismatch in parameters. –