espnet2.asr.encoder.avhubert_encoder.AVHubertConfig

About 8 min

espnet2.asr.encoder.avhubert_encoder.AVHubertConfig

class espnet2.asr.encoder.avhubert_encoder.AVHubertConfig(sample_rate: int = 16000, label_rate: int = -1, encoder_layers: int = 12, encoder_embed_dim: int = 768, encoder_ffn_embed_dim: int = 3072, encoder_attention_heads: int = 12, activation_fn: str = 'gelu', dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.0, encoder_layerdrop: float = 0.0, dropout_input: float = 0.0, dropout_features: float = 0.0, final_dim: int = 0, untie_final_proj: bool = False, layer_norm_first: bool = False, conv_feature_layers: str = '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', conv_bias: bool = False, logit_temp: float = 0.1, target_glu: bool = False, feature_grad_mult: float = 1.0, mask_length_audio: int = 10, mask_prob_audio: float = 0.65, mask_length_image: int = 10, mask_prob_image: float = 0.65, mask_selection: str = 'static', mask_other: float = 0, no_mask_overlap: bool = False, mask_min_space: int = 1, mask_channel_length: int = 10, mask_channel_prob: float = 0.0, mask_channel_selection: str = 'static', mask_channel_other: float = 0, no_mask_channel_overlap: bool = False, mask_channel_min_space: int = 1, conv_pos: int = 128, conv_pos_groups: int = 16, latent_temp: Tuple[float, float, float] = (2, 0.5, 0.999995), skip_masked: bool = False, skip_nomask: bool = False, resnet_relu_type: str = 'prelu', resnet_weights: str | None = None, sim_type: str = 'cosine', sub_encoder_layers: int = 0, audio_feat_dim: int = -1, modality_dropout: float = 0, audio_dropout: float = 0, modality_fuse: str = 'concat', selection_type: str = 'same_other_seq', masking_type: str = 'input', decoder_embed_dim: int = 768, decoder_ffn_embed_dim: int = 3072, decoder_layers: int = 6, decoder_layerdrop: float = 0.0, decoder_attention_heads: int = 4, decoder_learned_pos: bool = False, decoder_normalize_before: bool = False, no_token_positional_embeddings: bool = False, decoder_dropout: float = 0.1, decoder_attention_dropout: float = 0.1, decoder_activation_dropout: float = 0.0, max_target_positions: int = 2048, share_decoder_input_output_embed: bool = False, audio_only: bool = False, no_scale_embedding: bool = True)

Bases: object

Configuration for AV-HuBERT model.

This class encapsulates the configuration settings required for the AV-HuBERT model. It includes parameters related to the audio and video modalities, as well as dropout rates and other model hyperparameters.

sample_rate

Target sample rate for audio files, which will be up/down sampled to this rate. Default is 16000.

Type: int

label_rate

Label frame rate. Set to -1 for sequence label.

Type: int

encoder_layers

Number of encoder layers in the transformer. Default is 12.

Type: int

encoder_embed_dim

Encoder embedding dimension. Default is 768.

Type: int

encoder_ffn_embed_dim

Encoder embedding dimension for feedforward networks. Default is 3072.

Type: int

encoder_attention_heads

Number of attention heads in the encoder. Default is 12.

Type: int

activation_fn

Activation function to use. Default is “gelu”.

Type: str

dropout

Dropout probability for the transformer. Default is 0.1.

Type: float

attention_dropout

Dropout probability for attention weights. Default is 0.1.

Type: float

activation_dropout

Dropout probability after activation in feedforward networks. Default is 0.0.

Type: float

encoder_layerdrop

Probability of dropping a transformer layer. Default is 0.0.

Type: float

dropout

_input

Dropout applied to the input after feature extraction. Default is 0.0.

Type: float

dropout

_features

Dropout applied to the features after feature extraction. Default is 0.0.

Type: float

final_dim

Project final representations and targets to this many dimensions. Set to encoder_embed_dim if <= 0. Default is 0.

Type: int

untie_final_proj

Use separate projection for each target. Default is False.

Type: bool

layer_norm_first

Apply layer normalization first in the transformer. Default is False.

Type: bool

conv_feature_layers

Description of convolutional feature extraction layers in the form of a Python list. Default is “[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2”.

Type: str

conv_bias

Include bias in the convolutional encoder. Default is False.

Type: bool

logit_temp

Temperature to divide logits by. Default is 0.1.

Type: float

target_glu

Adds projection + GLU to targets. Default is False.

Type: bool

feature_grad_mult

Multiply feature extractor variable gradients by this value. Default is 1.0.

Type: float

mask_length_audio

Length of the mask for audio features. Default is 10.

Type: int

mask_prob_audio

Probability of replacing a token with a mask for audio features. Default is 0.65.

Type: float

mask_length_image

Length of the mask for image features. Default is 10.

Type: int

mask_prob_image

Probability of replacing a token with a mask for image features. Default is 0.65.

Type: float

mask_selection

Method for choosing mask length. Default is “static”.

Type: str

mask_other

Secondary mask argument for more complex distributions. Default is 0.

Type: float

no_mask_overlap

Whether to allow masks to overlap. Default is False.

Type: bool

mask_min_space

Minimum space between spans if no overlap is enabled. Default is 1.

Type: int

mask_channel_length

Length of the mask for features (channels). Default is 10.

Type: int

mask_channel_prob

Probability of replacing a feature with 0 for channel masking. Default is 0.0.

Type: float

mask_channel_selection

Method for choosing mask length for channel masking. Default is “static”.

Type: str

mask_channel_other

Secondary mask argument for more complex distributions for channel masking. Default is 0.

Type: float

no_mask_channel_overlap

Whether to allow channel masks to overlap. Default is False.

Type: bool

mask_channel_min_space

Minimum space between spans if no overlap is enabled for channel masking. Default is 1.

Type: int

conv_pos

Number of filters for convolutional positional embeddings. Default is 128.

Type: int

conv_pos

_groups

Number of groups for convolutional positional embedding. Default is 16.

Type: int

latent_temp

Legacy parameter (to be removed). Default is (2, 0.5, 0.999995).

Type: Tuple[float, float, float]

skip_masked

Skip computing losses over masked frames. Default is False.

Type: bool

skip_nomask

Skip computing losses over unmasked frames. Default is False.

Type: bool

resnet_relu_type

ReLU type for ResNet. Default is “prelu”.

Type: str

resnet_weights

Pretrained ResNet weights. Default is None.

Type: Optional[str]

sim_type

Similarity type for loss computation. Default is “cosine”.

Type: str

sub_encoder_layers

Number of transformer layers for single modality. Default is 0.

Type: int

audio_feat_dim

Audio feature dimension. Default is -1.

Type: int

modality_dropout

Drop one modality. Default is 0.

Type: float

audio_dropout

Drop audio feature. Default is 0.

Type: float

modality_fuse

Method for fusing two modalities: “add” or “concat”. Default is “concat”.

Type: str

selection_type

Type of selecting images. Default is “same_other_seq”.

Type: str

masking_type

Type of masking: “input” or “feature”. Default is “input”.

Type: str

decoder_embed_dim

Decoder embedding dimension. Default is 768.

Type: int

decoder_ffn_embed_dim

Decoder embedding dimension for FFN. Default is 3072.

Type: int

decoder_layers

Number of decoder layers. Default is 6.

Type: int

decoder_layerdrop

Decoder layer drop chance. Default is 0.0.

Type: float

decoder_attention_heads

Number of decoder attention heads. Default is 4.

Type: int

decoder_learned_pos

Use learned positional embeddings in the decoder. Default is False.

Type: bool

decoder_normalize_before

Apply layer normalization before each decoder block. Default is False.

Type: bool

no_token_positional_embeddings

If set, disables positional embeddings (outside self-attention). Default is False.

Type: bool

decoder_dropout

Dropout probability in the decoder. Default is 0.1.

Type: float

decoder_attention_dropout

Dropout probability for attention weights inside the decoder. Default is 0.1.

Type: float

decoder_activation_dropout

Dropout probability after activation in FFN inside the decoder. Default is 0.0.

Type: float

max_target_positions

Maximum target positions. Default is 2048.

Type: int

share_decoder_input_output_embed

Share decoder input and output embeddings. Default is False.

Type: bool

audio_only

Whether to use audio stream only. Default is False.

Type: bool

no_scale_embedding

Scale embedding. Default is True.

Type: bool

Examples

config = AVHubertConfig( : sample_rate=16000, encoder_layers=12, modality_fuse=’concat’, audio_only=True

)

activation_dropout

*: float* *= 0.0*

activation_fn

*: str* *= 'gelu'*

attention_dropout

*: float* *= 0.1*

audio_dropout

*: float* *= 0*

audio_feat_dim

*: int* *= -1*

audio_only

*: bool* *= False*

conv_bias

*: bool* *= False*

conv_feature_layers

*: str* *= '[(512,10,5)] + [(512,3,2)] \* 4 + [(512,2,2)] \* 2'*

conv_pos

*: int* *= 128*

conv_pos

_groups *: int* *= 16*

decoder_activation_dropout

*: float* *= 0.0*

decoder_attention_dropout

*: float* *= 0.1*

decoder_attention_heads

*: int* *= 4*

decoder_dropout

*: float* *= 0.1*

decoder_embed_dim

*: int* *= 768*

decoder_ffn_embed_dim

*: int* *= 3072*

decoder_layerdrop

*: float* *= 0.0*

decoder_layers

*: int* *= 6*

decoder_learned_pos

*: bool* *= False*

decoder_normalize_before

*: bool* *= False*

dropout

*: float* *= 0.1*

dropout

_features *: float* *= 0.0*

dropout

_input *: float* *= 0.0*

encoder_attention_heads

*: int* *= 12*

encoder_embed_dim

*: int* *= 768*

encoder_ffn_embed_dim

*: int* *= 3072*

encoder_layerdrop

*: float* *= 0.0*

encoder_layers

*: int* *= 12*

feature_grad_mult

*: float* *= 1.0*

final_dim

*: int* *= 0*

label_rate

*: int* *= -1*

latent_temp

*: Tuple[float, float, float]* *= (2, 0.5, 0.999995)*

layer_norm_first

*: bool* *= False*

logit_temp

*: float* *= 0.1*

mask_channel_length

*: int* *= 10*

mask_channel_min_space

*: int* *= 1*

mask_channel_other

*: float* *= 0*

mask_channel_prob

*: float* *= 0.0*

mask_channel_selection

*: str* *= 'static'*

mask_length_audio

*: int* *= 10*

mask_length_image

*: int* *= 10*

mask_min_space

*: int* *= 1*

mask_other

*: float* *= 0*

mask_prob_audio

*: float* *= 0.65*

mask_prob_image

*: float* *= 0.65*

mask_selection

*: str* *= 'static'*

masking_type

*: str* *= 'input'*

max_target_positions

*: int* *= 2048*

modality_dropout

*: float* *= 0*

modality_fuse

*: str* *= 'concat'*

no_mask_channel_overlap

*: bool* *= False*

no_mask_overlap

*: bool* *= False*

no_scale_embedding

*: bool* *= True*

no_token_positional_embeddings

*: bool* *= False*

resnet_relu_type

*: str* *= 'prelu'*

resnet_weights

*: str | [None](AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)* *= None*

sample_rate

*: int* *= 16000*

selection_type

*: str* *= 'same_other_seq'*

share_decoder_input_output_embed

*: bool* *= False*

sim_type

*: str* *= 'cosine'*

skip_masked

*: bool* *= False*

skip_nomask

*: bool* *= False*

sub_encoder_layers

*: int* *= 0*

target_glu

*: bool* *= False*

untie_final_proj

*: bool* *= False*