espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder

About 4 min

espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder

class espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)

Bases: AbsEncoder

FairSeq Hubert pretrain encoder module, used for the pretraining stage.

This class implements the pretraining encoder for the Hubert model, which is used to learn representations of audio data through masked prediction tasks. It is part of the FairSeq implementation of Hubert, allowing for various configurations including dropout rates, sample rates, and dictionary paths.

input_size

Input dimension.

Type: int

output_size

Dimension of attention.

Type: int

linear_units

Dimension of feedforward layers.

Type: int

attention_heads

Number of heads in multi-head attention.

Type: int

num_blocks

Number of encoder blocks.

Type: int

dropout_rate

Dropout rate applied to layers.

Type: float

attention_dropout_rate

Dropout rate applied to attention layers.

Type: float

hubert_dict

Path to the target dictionary for Hubert pretraining.

Type: str

label_rate

Frame rate for labels. -1 indicates sequence labels.

Type: int

sample_rate

Target sample rate for audio data.

Type: int

use_amp

Indicates whether to use automatic mixed precision.

Type: bool
Parameters:
- input_size – Input dimension.
- output_size – Dimension of attention.
- linear_units – Dimension of feedforward layers.
- attention_heads – Number of heads in multi-head attention.
- num_blocks – Number of encoder blocks.
- dropout_rate – Dropout rate for layers.
- attention_dropout_rate – Dropout rate for attention layers.
- hubert_dict – Path to the target dictionary for Hubert pretraining.
- label_rate – Frame rate for labels. -1 indicates sequence labels.
- sample_rate – Target sample rate for audio data.
- use_amp – Whether to use automatic mixed precision.

############# Examples

Instantiate the encoder

encoder = FairseqHubertPretrainEncoder(

input_size=1, output_size=1024, linear_units=1024, attention_heads=12, num_blocks=12, dropout_rate=0.1, hubert_dict=”./dict.txt”

)

Forward pass with input tensor

xs_pad = torch.randn(32, 100, 1) # (B, L, D) ilens = torch.tensor([100] * 32) # Lengths ys_pad = torch.randn(32, 50, 1024) # Target labels ys_pad_length = torch.tensor([50] * 32) # Lengths of targets outputs = encoder(xs_pad, ilens, ys_pad, ys_pad_length)

Returns: Output tensor, output lengths, and optional additional information.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

######## NOTE Ensure that the required libraries (FairSeq) are properly installed to utilize this encoder. Check the documentation for detailed instructions on installation and setup.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

cast_mask_emb()

Cast the mask embedding to half precision if using AMP.

This method checks if automatic mixed precision (AMP) is enabled and casts the mask embedding parameter of the encoder to half precision (float16) if it is not already in that format. This is particularly useful for improving performance and reducing memory usage during training on compatible hardware.

######## NOTE This method is typically called during the forward pass of the encoder to ensure that the mask embedding is in the correct format for mixed precision training.

Raises:TypeError – If the encoder’s mask embedding is not a torch Parameter.

############# Examples

If self.use_amp is True and the mask embedding is not in half precision, this method will convert it:

>>> encoder = FairseqHubertPretrainEncoder(...)
>>> encoder.use_amp = True
>>> encoder.cast_mask_emb()  # Mask embedding is cast to half precision.

forward(xs_pad: Tensor, ilens: Tensor, ys_pad: Tensor, ys_pad_length: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Forward pass for the Hubert Pretrain Encoder.

This method processes the input tensor through the Hubert encoder. Depending on whether the model is in finetuning or pretraining mode, it directs the input to the appropriate forward function.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (B, L, D), where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – Input lengths tensor of shape (B), containing the actual lengths of the input sequences.
- ys_pad (torch.Tensor , optional) – Target tensor of shape (B, L_y, D), where L_y is the length of the target sequences. This is only used in pretraining mode. Default is None.
- ys_pad_length (torch.Tensor , optional) – Lengths of the target sequences, used only in pretraining mode. Default is None.
- prev_states (torch.Tensor , optional) – Placeholder for previous states, not utilized in the current implementation. Default is None.
Returns: A tuple containing:
- position embedded tensor (B, T, D)
- mask tensor indicating the number of valid elements in the
output (B)
- Optional tensor for additional outputs, currently None.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

############# Examples

>>> encoder = TorchAudioHuBERTPretrainEncoder()
>>> xs_pad = torch.randn(2, 100, 768)  # Example input
>>> ilens = torch.tensor([100, 80])  # Input lengths
>>> ys_pad = torch.randint(0, 10, (2, 50))  # Example targets
>>> ys_pad_length = torch.tensor([50, 50])  # Target lengths
>>> output = encoder.forward(xs_pad, ilens, ys_pad, ys_pad_length)

######## NOTE The method will return different outputs based on the finetuning state of the model.

output_size

() → int

Get the output size of the encoder.

This method returns the dimension of the output from the encoder, which corresponds to the embedding dimension used in the model.

Returns: The output size (embedding dimension) of the encoder.
Return type: int

############# Examples

>>> encoder = FairseqHubertEncoder(output_size=256)
>>> encoder.output_size()
256

>>> encoder = FairseqHubertEncoder(output_size=512)
>>> encoder.output_size()
512

reload_pretrained_parameters()

Reload the pretrained parameters for the Hubert model.

This method loads the parameters from the previously stored state dictionary pretrained_params back into the Hubert model. It allows for restoring the model’s weights to a pretrained state, which is useful during fine-tuning or after training sessions.

Logging is performed to indicate the successful reloading of parameters.

############# Examples

>>> encoder = FairseqHubertEncoder(...)
>>> encoder.reload_pretrained_parameters()
Pretrained Hubert model parameters reloaded!

######## NOTE The method uses strict=False when loading the state dictionary, which means that it will ignore any keys that are not found in the current model. This is particularly useful if the model architecture has changed since the parameters were saved.

Raises:RuntimeError – If there is an issue with loading the state dictionary.