espnet2.asr.encoder.multiconvformer_encoder.MultiConvConformerEncoder

About 3 min

espnet2.asr.encoder.multiconvformer_encoder.MultiConvConformerEncoder

class espnet2.asr.encoder.multiconvformer_encoder.MultiConvConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, cgmlp_linear_units: int = 2048, multicgmlp_type: str = 'concat_fusion', multicgmlp_kernel_sizes: int | str = '7,15,23,31', multicgmlp_merge_conv_kernel: int = 31, multicgmlp_use_non_linear: int = True, use_linear_after_conv: bool = False, gate_activation: str = 'identity', input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, stochastic_depth_rate: float | List[float] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000)

Bases: AbsEncoder

Multiconvformer encoder module for automatic speech recognition (ASR).

This encoder combines multiple convolutional layers and attention mechanisms to process input sequences efficiently. It utilizes a variety of configurations for positional encoding, attention types, and feed-forward layers, making it versatile for different ASR tasks.

For detailed information, refer to the paper: https://arxiv.org/abs/2407.03718.

_output_size

The output dimension of the encoder.

Type: int
Parameters:
- input_size (int) – Input dimension.
- output_size (int) – Dimension of attention. Default is 256.
- attention_heads (int) – The number of heads for multi-head attention. Default is 4.
- linear_units (int) – Number of units for position-wise feed forward. Default is 2048.
- num_blocks (int) – Number of encoder blocks. Default is 6.
- dropout_rate (float) – Dropout rate. Default is 0.1.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding. Default is 0.1.
- attention_dropout_rate (float) – Dropout rate in attention. Default is 0.0.
- cgmlp_linear_units (int) – Number of units in CGMLP block. Default is 2048.
- multicgmlp_type (str) – Type of CGMLP (“sum”, “weighted_sum”, “concat”, “concat_fusion”). Default is “concat_fusion”.
- multicgmlp_kernel_sizes (Union *[*int , str ]) – Comma-separated list of kernel sizes. Default is “7,15,23,31”.
- multicgmlp_merge_conv_kernel (int) – Number of kernels for depthwise convolution fusion. Default is 31.
- use_linear_after_conv (bool) – Use a linear layer after MultiCGMLP. Default is False.
- gate_activation (str) – Activation function for CGMLP gating. Default is “identity”.
- input_layer (Union *[*str , torch.nn.Module ]) – Type of input layer. Default is “conv2d”.
- normalize_before (bool) – Use layer normalization before the first block. Default is True.
- concat_after (bool) – Concatenate attention input and output. Default is False.
- positionwise_layer_type (str) – Type of positionwise layer (“linear”, “conv1d”, or “conv1d-linear”). Default is “linear”.
- positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer. Default is 3.
- rel_pos_type (str) – Type of relative positional encoding (“legacy” or “latest”). Default is “legacy”.
- encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type. Default is “rel_pos”.
- encoder_attn_layer_type (str) – Encoder attention layer type. Default is “rel_selfattn”.
- activation_type (str) – Activation function type. Default is “swish”.
- macaron_style (bool) – Use macaron style for positionwise layer. Default is False.
- use_cnn_module (bool) – Use convolution module. Default is True.
- zero_triu (bool) – Zero the upper triangular part of attention matrix. Default is False.
- cnn_module_kernel (int) – Kernel size of convolution module. Default is unspecified.
- padding_idx (int) – Padding index for input_layer=”embed”. Default is -1.
- interctc_layer_idx (List *[*int ]) – Indices for intermediate CTC layers. Default is empty list.
- interctc_use_conditioning (bool) – Use conditioning for intermediate CTC. Default is False.
- stochastic_depth_rate (Union *[*float , List *[*float ] ]) – Rate for stochastic depth. Default is 0.0.
- layer_drop_rate (float) – Drop rate for layers. Default is 0.0.
- max_pos_emb_len (int) – Maximum positional embedding length. Default is 5000.

######### Examples

Create an instance of the encoder

encoder = MultiConvConformerEncoder(input_size=128)

Forward propagation

xs_pad = torch.randn(32, 100, 128) # Example input (batch_size, seq_len, input_size) ilens = torch.tensor([100] * 32) # Example input lengths output, olens, _ = encoder(xs_pad, ilens)

####### NOTE This encoder is designed for ASR tasks and can be customized with various configurations based on the specific requirements of the task.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None, ctc: CTC | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Calculate forward propagation through the MultiConvConformerEncoder.

This method processes the input tensor through the encoder layers and returns the output tensor along with the corresponding output lengths. The method also handles any necessary subsampling of the input and applies the appropriate embeddings.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (#batch, L, input_size).
- ilens (torch.Tensor) – Input lengths of shape (#batch).
- prev_states (torch.Tensor , optional) – Previous states, not used currently.
- ctc (CTC , optional) – Connectionist Temporal Classification object, not used currently.
Returns:
- Output tensor of shape (#batch, L, output_size).
- Output lengths of shape (#batch).
- Placeholder for additional output, not used currently (None).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:TooShortUttError – If the input sequence length is shorter than the required minimum length for subsampling.

######### Examples

>>> encoder = MultiConvConformerEncoder(input_size=80, output_size=256)
>>> xs_pad = torch.randn(32, 100, 80)  # Batch of 32, 100 time steps, 80 features
>>> ilens = torch.tensor([100] * 32)  # All inputs have length 100
>>> output, olens, _ = encoder.forward(xs_pad, ilens)

####### NOTE This method uses the embedding layers and encoder layers defined in the class constructor. It is important to ensure that the input data is properly preprocessed before calling this method.

output_size() → int

Return the output size of the MultiConvConformerEncoder.

This method provides the dimension of the output from the encoder. The output size is typically used in subsequent layers of a neural network model, such as in decoders or for classification tasks.

Returns: The output size of the encoder, defined during initialization.
Return type: int

######### Examples

>>> encoder = MultiConvConformerEncoder(input_size=128, output_size=256)
>>> encoder.output_size()
256

####### NOTE The output size is set during the initialization of the encoder and cannot be changed afterwards.