espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder
espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder
class espnet2.asr.encoder.hubert_encoder.TorchAudioHuBERTPretrainEncoder(input_size: int = None, extractor_mode: str = 'group_norm', extractor_conv_layer_config: List[List[int]] | None = [[512, 10, 5], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 3, 2], [512, 2, 2], [512, 2, 2]], extractor_conv_bias: bool = False, encoder_embed_dim: int = 768, encoder_projection_dropout: float = 0.1, encoder_pos_conv_kernel: int = 128, encoder_pos_conv_groups: int = 16, encoder_num_layers: int = 12, encoder_num_heads: int = 12, encoder_attention_dropout: float = 0.1, encoder_ff_interm_features: int = 3072, encoder_ff_interm_dropout: float = 0.0, encoder_dropout: float = 0.1, encoder_layer_norm_first: bool = False, encoder_layer_drop: float = 0.05, mask_prob: float = 0.8, mask_selection: str = 'static', mask_other: float = 0.0, mask_length: int = 10, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0.0, mask_channel_length: int = 10, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, skip_masked: bool = False, skip_nomask: bool = False, num_classes: int = 100, final_dim: int = 256, feature_grad_mult: float | None = 0.1, finetuning: bool = False, freeze_encoder_updates: int = 0)
Bases: AbsEncoder
Torch Audio Hubert encoder module.
- Parameters:
- extractor_mode β Operation mode of feature extractor. Valid values are βgroup_normβ or βlayer_normβ.
- extractor_conv_layer_config β Configuration of convolution layers in feature extractor. List of convolution configuration, i.e. [[output_channel, kernel_size, stride], β¦]
- extractor_conv_bias β Whether to include bias term to each convolution operation.
- encoder_embed_dim β The dimension of embedding in encoder.
- encoder_projection_dropout β The dropout probability applied after the input feature is projected to βencoder_embed_dimβ.
- encoder_pos_conv_kernel β Kernel size of convolutional positional embeddings.
- encoder_pos_conv_groups β Number of groups of convolutional positional embeddings.
- encoder_num_layers β Number of self attention layers in transformer block.
- encoder_num_heads β Number of heads in self attention layers.
- encoder_attention_dropout β Dropout probability applied after softmax in self-attention layer.
- encoder_ff_interm_features β Dimension of hidden features in feed forward layer.
- encoder_ff_interm_dropout β Dropout probability applied in feedforward layer.
- encoder_dropout β Dropout probability applied at the end of feed forward layer.
- encoder_layer_norm_first β Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers.
- encoder_layer_drop β Probability to drop each encoder layer during training.
- mask_prob β Probability for each token to be chosen as start of the span to be masked.
- mask_selection β How to choose the mask length. Options: [static, uniform, normal, poisson].
- mask_other β Secondary mask argument (used for more complex distributions).
- mask_length β The lengths of the mask.
- no_mask_overlap β Whether to allow masks to overlap.
- mask_min_space β Minimum space between spans (if no overlap is enabled).
- mask_channel_prob β (float): The probability of replacing a feature with 0.
- mask_channel_selection β How to choose the mask length for channel masking. Options: [static, uniform, normal, poisson].
- mask_channel_other β Secondary mask argument for channel masking(used for more complex distributions).
- mask_channel_length β Minimum space between spans (if no overlap is enabled) for channel masking.
- no_mask_channel_overlap β Whether to allow channel masks to overlap.
- mask_channel_min_space β Minimum space between spans for channel masking(if no overlap is enabled).
- skip_masked β If True, skip computing losses over masked frames.
- skip_nomask β If True, skip computing losses over unmasked frames.
- num_classes β The number of classes in the labels.
- final_dim β Project final representations and targets to final_dim.
- feature_grad_mult β The factor to scale the convolutional feature extraction layer gradients by. The scale factor will not affect the forward pass.
- finetuning β Whether to finetuning the model with ASR or other tasks.
- freeze_encoder_updates β The number of steps to freeze the encoder parameters in ASR finetuning.
Hubert specific Args: : Please refer to: https://pytorch.org/audio/stable/generated/torchaudio.models.hubert_pretrain_model.html#torchaudio.models.hubert_pretrain_model
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(xs_pad: Tensor, ilens: Tensor, ys_pad: Tensor = None, ys_pad_length: Tensor = None, prev_states: Tensor = None) β Tuple[Tensor, Tensor, Tensor | None]
Forward Hubert Pretrain Encoder.
- Parameters:
- xs_pad β input tensor (B, L, D)
- ilens β input length (B)
- prev_states β Not to be used now.
- Returns: position embedded tensor and mask
output_size() β int
reload_pretrained_parameters()
