espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder
espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder
class espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)
Bases: AbsEncoder
FairSeq Hubert pretrain encoder module, used for the pretraining stage.
This class implements the pretraining encoder for the Hubert model, which is used to learn representations of audio data through masked prediction tasks. It is part of the FairSeq implementation of Hubert, allowing for various configurations including dropout rates, sample rates, and dictionary paths.
input_size
Input dimension.
- Type: int
output_size
Dimension of attention.
- Type: int
linear_units
Dimension of feedforward layers.
- Type: int
attention_heads
Number of heads in multi-head attention.
- Type: int
num_blocks
Number of encoder blocks.
- Type: int
dropout_rate
Dropout rate applied to layers.
- Type: float
attention_dropout_rate
Dropout rate applied to attention layers.
- Type: float
hubert_dict
Path to the target dictionary for Hubert pretraining.
- Type: str
label_rate
Frame rate for labels. -1 indicates sequence labels.
- Type: int
sample_rate
Target sample rate for audio data.
- Type: int
use_amp
Indicates whether to use automatic mixed precision.
Type: bool
Parameters:
- input_size – Input dimension.
- output_size – Dimension of attention.
- linear_units – Dimension of feedforward layers.
- attention_heads – Number of heads in multi-head attention.
- num_blocks – Number of encoder blocks.
- dropout_rate – Dropout rate for layers.
- attention_dropout_rate – Dropout rate for attention layers.
- hubert_dict – Path to the target dictionary for Hubert pretraining.
- label_rate – Frame rate for labels. -1 indicates sequence labels.
- sample_rate – Target sample rate for audio data.
- use_amp – Whether to use automatic mixed precision.
############# Examples
Instantiate the encoder
encoder = FairseqHubertPretrainEncoder(
input_size=1, output_size=1024, linear_units=1024, attention_heads=12, num_blocks=12, dropout_rate=0.1, hubert_dict=”./dict.txt”
)
Forward pass with input tensor
xs_pad = torch.randn(32, 100, 1) # (B, L, D) ilens = torch.tensor([100] * 32) # Lengths ys_pad = torch.randn(32, 50, 1024) # Target labels ys_pad_length = torch.tensor([50] * 32) # Lengths of targets outputs = encoder(xs_pad, ilens, ys_pad, ys_pad_length)
- Returns: Output tensor, output lengths, and optional additional information.
- Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
######## NOTE Ensure that the required libraries (FairSeq) are properly installed to utilize this encoder. Check the documentation for detailed instructions on installation and setup.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
cast_mask_emb()
Cast the mask embedding to half precision if using AMP.
This method checks if automatic mixed precision (AMP) is enabled and casts the mask embedding parameter of the encoder to half precision (float16) if it is not already in that format. This is particularly useful for improving performance and reducing memory usage during training on compatible hardware.
######## NOTE This method is typically called during the forward pass of the encoder to ensure that the mask embedding is in the correct format for mixed precision training.
- Raises:TypeError – If the encoder’s mask embedding is not a torch Parameter.
############# Examples
If self.use_amp is True and the mask embedding is not in half precision, this method will convert it:
>>> encoder = FairseqHubertPretrainEncoder(...)
>>> encoder.use_amp = True
>>> encoder.cast_mask_emb() # Mask embedding is cast to half precision.
forward(xs_pad: Tensor, ilens: Tensor, ys_pad: Tensor, ys_pad_length: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]
Forward pass for the Hubert Pretrain Encoder.
This method processes the input tensor through the Hubert encoder. Depending on whether the model is in finetuning or pretraining mode, it directs the input to the appropriate forward function.
- Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (B, L, D), where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – Input lengths tensor of shape (B), containing the actual lengths of the input sequences.
- ys_pad (torch.Tensor , optional) – Target tensor of shape (B, L_y, D), where L_y is the length of the target sequences. This is only used in pretraining mode. Default is None.
- ys_pad_length (torch.Tensor , optional) – Lengths of the target sequences, used only in pretraining mode. Default is None.
- prev_states (torch.Tensor , optional) – Placeholder for previous states, not utilized in the current implementation. Default is None.
- Returns: A tuple containing:
- position embedded tensor (B, T, D)
- mask tensor indicating the number of valid elements in the
output (B)
- Optional tensor for additional outputs, currently None.
- Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
############# Examples
>>> encoder = TorchAudioHuBERTPretrainEncoder()
>>> xs_pad = torch.randn(2, 100, 768) # Example input
>>> ilens = torch.tensor([100, 80]) # Input lengths
>>> ys_pad = torch.randint(0, 10, (2, 50)) # Example targets
>>> ys_pad_length = torch.tensor([50, 50]) # Target lengths
>>> output = encoder.forward(xs_pad, ilens, ys_pad, ys_pad_length)
######## NOTE The method will return different outputs based on the finetuning state of the model.
output_size
Get the output size of the encoder.
This method returns the dimension of the output from the encoder, which corresponds to the embedding dimension used in the model.
- Returns: The output size (embedding dimension) of the encoder.
- Return type: int
############# Examples
>>> encoder = FairseqHubertEncoder(output_size=256)
>>> encoder.output_size()
256
>>> encoder = FairseqHubertEncoder(output_size=512)
>>> encoder.output_size()
512
reload_pretrained_parameters()
Reload the pretrained parameters for the Hubert model.
This method loads the parameters from the previously stored state dictionary pretrained_params back into the Hubert model. It allows for restoring the model’s weights to a pretrained state, which is useful during fine-tuning or after training sessions.
Logging is performed to indicate the successful reloading of parameters.
############# Examples
>>> encoder = FairseqHubertEncoder(...)
>>> encoder.reload_pretrained_parameters()
Pretrained Hubert model parameters reloaded!
######## NOTE The method uses strict=False when loading the state dictionary, which means that it will ignore any keys that are not found in the current model. This is particularly useful if the model architecture has changed since the parameters were saved.
- Raises:RuntimeError – If there is an issue with loading the state dictionary.