espnet2.asr.encoder.avhubert_encoder.AVHubertConfig
espnet2.asr.encoder.avhubert_encoder.AVHubertConfig
class espnet2.asr.encoder.avhubert_encoder.AVHubertConfig(sample_rate: int = 16000, label_rate: int = -1, encoder_layers: int = 12, encoder_embed_dim: int = 768, encoder_ffn_embed_dim: int = 3072, encoder_attention_heads: int = 12, activation_fn: str = 'gelu', dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.0, encoder_layerdrop: float = 0.0, dropout_input: float = 0.0, dropout_features: float = 0.0, final_dim: int = 0, untie_final_proj: bool = False, layer_norm_first: bool = False, conv_feature_layers: str = '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', conv_bias: bool = False, logit_temp: float = 0.1, target_glu: bool = False, feature_grad_mult: float = 1.0, mask_length_audio: int = 10, mask_prob_audio: float = 0.65, mask_length_image: int = 10, mask_prob_image: float = 0.65, mask_selection: str = 'static', mask_other: float = 0, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_length: int = 10, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, conv_pos: int = 128, conv_pos_groups: int = 16, latent_temp: Tuple[float, float, float] = (2, 0.5, 0.999995), skip_masked: bool = False, skip_nomask: bool = False, resnet_relu_type: str = 'prelu', resnet_weights: str | None = None, sim_type: str = 'cosine', sub_encoder_layers: int = 0, audio_feat_dim: int = -1, modality_dropout: float = 0, audio_dropout: float = 0, modality_fuse: str = 'concat', selection_type: str = 'same_other_seq', masking_type: str = 'input', decoder_embed_dim: int = 768, decoder_ffn_embed_dim: int = 3072, decoder_layers: int = 6, decoder_layerdrop: float = 0.0, decoder_attention_heads: int = 4, decoder_learned_pos: bool = False, decoder_normalize_before: bool = False, no_token_positional_embeddings: bool = False, decoder_dropout: float = 0.1, decoder_attention_dropout: float = 0.1, decoder_activation_dropout: float = 0.0, max_target_positions: int = 2048, share_decoder_input_output_embed: bool = False, audio_only: bool = False, no_scale_embedding: bool = True)
Bases: object
Configuration for AV-HuBERT model.
This class encapsulates the configuration settings required for the AV-HuBERT model. It includes parameters related to the audio and video modalities, as well as dropout rates and other model hyperparameters.
sample_rate
Target sample rate for audio files, which will be up/down sampled to this rate. Default is 16000.
- Type: int
label_rate
Label frame rate. Set to -1 for sequence label.
- Type: int
encoder_layers
Number of encoder layers in the transformer. Default is 12.
- Type: int
encoder_embed_dim
Encoder embedding dimension. Default is 768.
- Type: int
encoder_ffn_embed_dim
Encoder embedding dimension for feedforward networks. Default is 3072.
- Type: int
encoder_attention_heads
Number of attention heads in the encoder. Default is 12.
- Type: int
activation_fn
Activation function to use. Default is “gelu”.
- Type: str
dropout
Dropout probability for the transformer. Default is 0.1.
- Type: float
attention_dropout
Dropout probability for attention weights. Default is 0.1.
- Type: float
activation_dropout
Dropout probability after activation in feedforward networks. Default is 0.0.
- Type: float
encoder_layerdrop
Probability of dropping a transformer layer. Default is 0.0.
- Type: float
dropout
Dropout applied to the input after feature extraction. Default is 0.0.
- Type: float
dropout
Dropout applied to the features after feature extraction. Default is 0.0.
- Type: float
final_dim
Project final representations and targets to this many dimensions. Set to encoder_embed_dim if <= 0. Default is 0.
- Type: int
untie_final_proj
Use separate projection for each target. Default is False.
- Type: bool
layer_norm_first
Apply layer normalization first in the transformer. Default is False.
- Type: bool
conv_feature_layers
Description of convolutional feature extraction layers in the form of a Python list. Default is “[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2”.
- Type: str
conv_bias
Include bias in the convolutional encoder. Default is False.
- Type: bool
logit_temp
Temperature to divide logits by. Default is 0.1.
- Type: float
target_glu
Adds projection + GLU to targets. Default is False.
- Type: bool
feature_grad_mult
Multiply feature extractor variable gradients by this value. Default is 1.0.
- Type: float
mask_length_audio
Length of the mask for audio features. Default is 10.
- Type: int
mask_prob_audio
Probability of replacing a token with a mask for audio features. Default is 0.65.
- Type: float
mask_length_image
Length of the mask for image features. Default is 10.
- Type: int
mask_prob_image
Probability of replacing a token with a mask for image features. Default is 0.65.
- Type: float
mask_selection
Method for choosing mask length. Default is “static”.
- Type: str
mask_other
Secondary mask argument for more complex distributions. Default is 0.
- Type: float
no_mask_overlap
Whether to allow masks to overlap. Default is False.
- Type: bool
mask_min_space
Minimum space between spans if no overlap is enabled. Default is 1.
- Type: int
mask_channel_length
Length of the mask for features (channels). Default is 10.
- Type: int
mask_channel_prob
Probability of replacing a feature with 0 for channel masking. Default is 0.0.
- Type: float
mask_channel_selection
Method for choosing mask length for channel masking. Default is “static”.
- Type: str
mask_channel_other
Secondary mask argument for more complex distributions for channel masking. Default is 0.
- Type: float
no_mask_channel_overlap
Whether to allow channel masks to overlap. Default is False.
- Type: bool
mask_channel_min_space
Minimum space between spans if no overlap is enabled for channel masking. Default is 1.
- Type: int
conv_pos
Number of filters for convolutional positional embeddings. Default is 128.
- Type: int
conv_pos
Number of groups for convolutional positional embedding. Default is 16.
- Type: int
latent_temp
Legacy parameter (to be removed). Default is (2, 0.5, 0.999995).
- Type: Tuple[float, float, float]
skip_masked
Skip computing losses over masked frames. Default is False.
- Type: bool
skip_nomask
Skip computing losses over unmasked frames. Default is False.
- Type: bool
resnet_relu_type
ReLU type for ResNet. Default is “prelu”.
- Type: str
resnet_weights
Pretrained ResNet weights. Default is None.
- Type: Optional[str]
sim_type
Similarity type for loss computation. Default is “cosine”.
- Type: str
sub_encoder_layers
Number of transformer layers for single modality. Default is 0.
- Type: int
audio_feat_dim
Audio feature dimension. Default is -1.
- Type: int
modality_dropout
Drop one modality. Default is 0.
- Type: float
audio_dropout
Drop audio feature. Default is 0.
- Type: float
modality_fuse
Method for fusing two modalities: “add” or “concat”. Default is “concat”.
- Type: str
selection_type
Type of selecting images. Default is “same_other_seq”.
- Type: str
masking_type
Type of masking: “input” or “feature”. Default is “input”.
- Type: str
decoder_embed_dim
Decoder embedding dimension. Default is 768.
- Type: int
decoder_ffn_embed_dim
Decoder embedding dimension for FFN. Default is 3072.
- Type: int
decoder_layers
Number of decoder layers. Default is 6.
- Type: int
decoder_layerdrop
Decoder layer drop chance. Default is 0.0.
- Type: float
decoder_attention_heads
Number of decoder attention heads. Default is 4.
- Type: int
decoder_learned_pos
Use learned positional embeddings in the decoder. Default is False.
- Type: bool
decoder_normalize_before
Apply layer normalization before each decoder block. Default is False.
- Type: bool
no_token_positional_embeddings
If set, disables positional embeddings (outside self-attention). Default is False.
- Type: bool
decoder_dropout
Dropout probability in the decoder. Default is 0.1.
- Type: float
decoder_attention_dropout
Dropout probability for attention weights inside the decoder. Default is 0.1.
- Type: float
decoder_activation_dropout
Dropout probability after activation in FFN inside the decoder. Default is 0.0.
- Type: float
max_target_positions
Maximum target positions. Default is 2048.
- Type: int
share_decoder_input_output_embed
Share decoder input and output embeddings. Default is False.
- Type: bool
audio_only
Whether to use audio stream only. Default is False.
- Type: bool
no_scale_embedding
Scale embedding. Default is True.
- Type: bool
Examples
config = AVHubertConfig( : sample_rate=16000, encoder_layers=12, modality_fuse=’concat’, audio_only=True
)
activation_dropout
activation_fn
attention_dropout
audio_dropout
audio_feat_dim
audio_only
conv_bias
conv_feature_layers
conv_pos
conv_pos
decoder_activation_dropout
decoder_attention_dropout
decoder_attention_heads
decoder_dropout
decoder_embed_dim
decoder_ffn_embed_dim
decoder_layerdrop
decoder_layers
decoder_learned_pos
decoder_normalize_before
dropout
dropout
dropout
encoder_attention_heads
encoder_embed_dim
encoder_ffn_embed_dim
encoder_layerdrop
encoder_layers
feature_grad_mult
final_dim
label_rate
latent_temp
layer_norm_first
logit_temp
mask_channel_length
mask_channel_min_space
mask_channel_other
mask_channel_prob
mask_channel_selection
mask_length_audio
mask_length_image
mask_min_space
mask_other
mask_prob_audio
mask_prob_image
mask_selection
masking_type
max_target_positions
modality_dropout
modality_fuse
no_mask_channel_overlap
no_mask_overlap
no_scale_embedding
no_token_positional_embeddings
resnet_relu_type
resnet_weights
sample_rate
selection_type
share_decoder_input_output_embed
sim_type
skip_masked
skip_nomask
sub_encoder_layers
target_glu
untie_final_proj