espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder
espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder
class espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)
Bases: AbsEncoder
FairSeq Hubert encoder module for loading pretrained weights and fine-tuning.
This class provides an interface for using the Hubert encoder architecture from FairSeq, enabling loading of pretrained models and fine-tuning for various tasks, particularly Automatic Speech Recognition (ASR).
- Parameters:
- input_size (int) – Input dimension for the model.
- hubert_url (str) – URL to download the Hubert pretrained model.
- hubert_dir_path (str) – Directory to save the downloaded model.
- output_size (int) – Dimension of the output from the encoder.
- normalize_before (bool) – Whether to apply layer normalization before the first block of the encoder.
- freeze_finetune_updates (int) – Number of updates to freeze all layers except the output layer before tuning the entire model. Necessary to prevent overfitting.
- dropout_rate (float) – Dropout rate applied in the encoder.
- activation_dropout (float) – Dropout rate applied in activation functions.
- attention_dropout (float) – Dropout rate applied in the attention mechanism.
Hubert-specific Args: : For more details, please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py
########### Examples
>>> encoder = FairseqHubertEncoder(input_size=512, hubert_url="path/to/model")
>>> xs_pad = torch.randn(10, 100, 512) # (B, L, D)
>>> ilens = torch.tensor([100] * 10) # Lengths of input sequences
>>> outputs, olens, _ = encoder(xs_pad, ilens)
######## NOTE Ensure that the FairSeq library is installed and properly configured to use this module.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]
Forward pass for the Hubert ASR Encoder.
This method processes the input tensor through the Hubert encoder, returning the position-embedded tensor along with a mask. The behavior of the method depends on whether the model is in finetuning mode or not.
- Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (B, L, D) where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – A tensor containing the lengths of the input sequences of shape (B).
- ys_pad (torch.Tensor , optional) – Target tensor of shape (B, T, C), where T is the target sequence length and C is the number of classes. Default is None.
- ys_pad_length (torch.Tensor , optional) – Lengths of the target sequences. Default is None.
- prev_states (torch.Tensor , optional) – Placeholder for previous states. Not used in the current implementation. Default is None.
- Returns: A tuple containing:
- Position embedded tensor of shape (B, T, D).
- A tensor containing the lengths of the outputs of shape (B).
- An optional tensor (currently None).
- Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
######## NOTE The method checks if the model is in finetuning mode. If it is not, it calls the _pretraining_forward method. If the model is in training mode, it calls _finetuning_forward, and otherwise, it calls _eval_forward.
########### Examples
>>> encoder = FairseqHubertEncoder(input_size=256)
>>> xs_pad = torch.randn(2, 10, 256) # Example input tensor
>>> ilens = torch.tensor([10, 8]) # Lengths of input sequences
>>> outputs = encoder(xs_pad, ilens)
>>> print(outputs[0].shape) # Position embedded tensor shape
>>> print(outputs[1]) # Output lengths
- Raises:
- ValueError – If ys_pad is None when not in finetuning mode and
- _pretraining_forward –
output_size() → int
Returns the output size of the encoder.
This method provides the dimensionality of the output from the encoder, which is crucial for understanding the model’s representation capacity.
- Returns: The output size of the encoder, typically representing the dimensionality of the final layer.
- Return type: int
########### Examples
>>> encoder = FairseqHubertEncoder(input_size=256, output_size=512)
>>> encoder.output_size()
512
######## NOTE The output size is set during the initialization of the encoder and can be modified if necessary.
reload_pretrained_parameters()
Reloads the pretrained parameters into the Hubert model.
This method loads the parameters stored in self.pretrained_params back into the Hubert pretraining model. This can be useful when you want to reset the model to its initial state after fine-tuning or experimentation.
It performs the following steps: : 1. Loads the state dictionary from self.pretrained_params. <br/> 2. Logs a message indicating that the pretrained parameters have been reloaded.
########### Examples
>>> encoder = FairseqHubertEncoder(input_size=128)
>>> encoder.reload_pretrained_parameters()
Pretrained Hubert model parameters reloaded!
######## NOTE The strict argument is set to False to allow loading of parameters that may not match exactly, which can be useful if some layers were added or modified.
- Raises:
- RuntimeError – If the state dictionary cannot be loaded
- due to a mismatch in parameters. –