espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder

About 4 min

espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder

class espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder(input_size: int = None, extractor_mode: str = 'group_norm', extractor_conv_layer_config: List[List[int]] | None = [[512, 10, 5], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 2, 2], [512, 2, 2]], extractor_conv_bias: bool = False, encoder_embed_dim: int = 768, encoder_projection_dropout: float = 0.1, encoder_pos_conv_kernel: int = 128, encoder_pos_conv_groups: int = 16, encoder_num_layers: int = 12, encoder_num_heads: int = 12, encoder_attention_dropout: float = 0.1, encoder_ff_interm_features: int = 3072, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_norm_first: bool = False, encoder_layer_drop: float = 0.05, mask_prob: float = 0.8, mask_selection: str = 'static', mask_other: float = 0.0, mask_length: int = 10, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0.0, mask_channel_length: int = 10, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, skip_masked: bool = False, skip_nomask: bool = False, num_classes: int = 100, final_dim: int = 256, feature_grad_mult: float | None = 0.1, finetuning: bool = False, freeze_encoder_updates: int = 0)

Bases: AbsEncoder

Torch Audio HuBERT encoder module for speech representation learning.

This class implements the HuBERT encoder, a model designed for self-supervised speech representation learning. The encoder uses convolutional layers followed by transformer blocks, allowing for efficient processing of audio input.

Parameters:
- extractor_mode (str) – Operation mode of the feature extractor. Valid values are “group_norm” or “layer_norm”.
- extractor_conv_layer_config (List *[*List *[*int ] ]) – Configuration of convolution layers in feature extractor. List of convolution configurations, i.e. [[output_channel, kernel_size, stride], …].
- extractor_conv_bias (bool) – Whether to include a bias term for each convolution operation.
- encoder_embed_dim (int) – The dimension of embedding in the encoder.
- encoder_projection_dropout (float) – The dropout probability applied after projecting the input feature to “encoder_embed_dim”.
- encoder_pos_conv_kernel (int) – Kernel size of convolutional positional embeddings.
- encoder_pos_conv_groups (int) – Number of groups for convolutional positional embeddings.
- encoder_num_layers (int) – Number of self-attention layers in the transformer block.
- encoder_num_heads (int) – Number of heads in the self-attention layers.
- encoder_attention_dropout (float) – Dropout probability applied after softmax in the self-attention layer.
- encoder_ff_interm_features (int) – Dimension of hidden features in the feedforward layer.
- encoder_ff_interm_dropout (float) – Dropout probability applied in the feedforward layer.
- encoder_dropout (float) – Dropout probability applied at the end of the feedforward layer.
- encoder_layer_norm_first (bool) – Control the order of layer norm in the transformer layer and each encoder layer. If True, layer norm is applied before features are fed to encoder layers.
- encoder_layer_drop (float) – Probability to drop each encoder layer during training.
- mask_prob (float) – Probability for each token to be chosen as the start of the span to be masked.
- mask_selection (str) – Method for choosing the mask length. Options: [static, uniform, normal, poisson].
- mask_other (float) – Secondary mask argument for more complex distributions.
- mask_length (int) – Lengths of the mask.
- no_mask_overlap (bool) – Whether to allow masks to overlap.
- mask_min_space (int) – Minimum space between spans if no overlap is enabled.
- mask_channel_prob (float) – Probability of replacing a feature with 0.
- mask_channel_selection (str) – Method for choosing the mask length for channel masking. Options: [static, uniform, normal, poisson].
- mask_channel_other (float) – Secondary mask argument for channel masking.
- mask_channel_length (int) – Minimum space between spans for channel masking if no overlap is enabled.
- no_mask_channel_overlap (bool) – Whether to allow channel masks to overlap.
- mask_channel_min_space (int) – Minimum space between spans for channel masking if no overlap is enabled.
- skip_masked (bool) – If True, skip computing losses over masked frames.
- skip_nomask (bool) – If True, skip computing losses over unmasked frames.
- num_classes (int) – The number of classes in the labels.
- final_dim (int) – Dimension to project final representations and targets.
- feature_grad_mult (Optional *[*float ]) – Factor to scale the convolutional feature extraction layer gradients. The scale factor does not affect the forward pass.
- finetuning (bool) – Whether to fine-tune the model with ASR or other tasks.
- freeze_encoder_updates (int) – Number of steps to freeze the encoder parameters in ASR fine-tuning.

Hubert specific Args: : Please refer to: https://pytorch.org/audio/stable/generated/torchaudio.models.hubert_pretrain_model.html#torchaudio.models.hubert_pretrain_model

########### Examples

>>> encoder = TorchAudioHuBERTPretrainEncoder()
>>> input_tensor = torch.randn(10, 16000)  # Example input
>>> output = encoder(input_tensor)  # Forward pass

####### NOTE Ensure that torchaudio is installed and properly configured for using the HuBERT model.

Raises:ImportError – If torchaudio is not installed.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, ys_pad: Tensor | None = None, ys_pad_length: Tensor | None = None, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Forward pass for the Hubert Pretrain Encoder.

This method processes the input tensor through the Hubert pretraining model. It handles both pretraining and fine-tuning modes, depending on the state of the model.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (B, L, D), where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – A tensor of shape (B,) representing the lengths of each input sequence in the batch.
- ys_pad (torch.Tensor , optional) – Target tensor of shape (B, L_y, D) for training. Defaults to None.
- ys_pad_length (torch.Tensor , optional) – A tensor of shape (B,) representing the lengths of each target sequence. Defaults to None.
- prev_states (torch.Tensor , optional) – Not used in the current version. Defaults to None.
Returns: A tuple containing:
- The position-embedded output tensor.
- A tensor representing the mask.
- An optional tensor, which is currently None.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

########### Examples

>>> encoder = TorchAudioHuBERTPretrainEncoder()
>>> input_tensor = torch.randn(4, 100, 768)  # (B, L, D)
>>> input_lengths = torch.tensor([100, 90, 80, 70])  # Lengths of each sequence
>>> output, mask, _ = encoder.forward(input_tensor, input_lengths)

####### NOTE

If the model is in fine-tuning mode, the method will skip certain computations based on the specified flags.
Ensure that the input tensor and its lengths are correctly formatted to avoid runtime errors.

output_size() → int

Get the output size of the encoder.

This method returns the dimension of the output from the encoder, which corresponds to the embedding dimension used in the model.

Returns: The output size (dimension of the encoder output).
Return type: int

########### Examples

>>> encoder = TorchAudioHuBERTPretrainEncoder()
>>> encoder.output_size()
768

reload_pretrained_parameters()

Reloads the pretrained parameters into the Hubert model.

This method loads the previously stored state dictionary of the pretrained Hubert model from self.pretrained_params and applies it to the current instance of the Hubert model. This is particularly useful when the model has been fine-tuned or modified and you want to revert to the original pretrained weights.

####### NOTE The loading is done with strict=False, which means that if there are keys in the state dict that are not found in the model, they will be ignored.

########### Examples

>>> encoder = TorchAudioHuBERTPretrainEncoder()
>>> # Assume some fine-tuning has happened here
>>> encoder.reload_pretrained_parameters()  # Reloads original weights

Raises:RuntimeError – If the model fails to load the state dict.