espnet2.asr.encoder.conformer_encoder.ConformerEncoder

About 4 min

espnet2.asr.encoder.conformer_encoder.ConformerEncoder

class espnet2.asr.encoder.conformer_encoder.ConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str | None = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, ctc_trim: bool = False, stochastic_depth_rate: float | List[float] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, qk_norm: bool = False, use_flash_attn: bool = True)

Bases: AbsEncoder

Conformer encoder module for automatic speech recognition.

This class implements the Conformer encoder, which combines convolutional neural networks and self-attention mechanisms to process sequential data such as speech. It is designed to capture both local and global dependencies in the input data effectively.

output_size

Dimension of the output from the encoder.

Type: int

embed

Input layer module for feature extraction.

Type: torch.nn.Module

normalize_before

Flag indicating if layer normalization is applied before the first block.

Type: bool

encoders

List of encoder layers comprising the main processing stack.

Type: List[EncoderLayer]

after_norm

Layer normalization applied after the encoder stack, if normalize_before is set to False.

Type:LayerNorm

interctc_layer_idx

Indices of layers for intermediate CTC outputs.

Type: List[int]

interctc_use_conditioning

Flag to indicate if conditioning on CTC outputs is used.

Type: bool

conditioning_layer

Conditioning layer for intermediate CTC outputs.

Type: Optional[torch.nn.Module]

ctc_trim

Flag indicating if CTC trimming is applied.

Type: bool
Parameters:
- input_size (int) – Input dimension.
- output_size (int) – Dimension of attention (default: 256).
- attention_heads (int) – Number of heads in multi-head attention (default: 4).
- linear_units (int) – Number of units in position-wise feed-forward layers (default: 2048).
- num_blocks (int) – Number of encoder blocks (default: 6).
- dropout_rate (float) – Dropout rate for regularization (default: 0.1).
- positional_dropout_rate (float) – Dropout rate after positional encoding (default: 0.1).
- attention_dropout_rate (float) – Dropout rate in attention layers (default: 0.0).
- input_layer (Union *[*str , torch.nn.Module ]) – Type of input layer (default: “conv2d”).
- normalize_before (bool) – Whether to use layer normalization before the first block (default: True).
- concat_after (bool) – Whether to concatenate input and output of the attention layer (default: False).
- positionwise_layer_type (str) – Type of position-wise layer (“linear”, “conv1d”, or “conv1d-linear”, default: “linear”).
- positionwise_conv_kernel_size (int) – Kernel size for position-wise convolution (default: 3).
- rel_pos_type (str) – Type of relative positional encoding (“legacy” or “latest”, default: “legacy”).
- pos_enc_layer_type (str) – Type of positional encoding layer (default: “rel_pos”).
- selfattention_layer_type (str) – Type of self-attention layer (default: “rel_selfattn”).
- activation_type (str) – Activation function type (default: “swish”).
- macaron_style (bool) – Whether to use Macaron style for position-wise layers (default: False).
- use_cnn_module (bool) – Whether to include convolutional modules (default: True).
- zero_triu (bool) – Whether to zero the upper triangular part of the attention matrix (default: False).
- cnn_module_kernel (int) – Kernel size for convolution modules (default: 31).
- padding_idx (int) – Padding index for embedding layers (default: -1).
- interctc_layer_idx (List *[*int ]) – Indices of layers for intermediate CTC outputs (default: []).
- interctc_use_conditioning (bool) – Flag to use conditioning on CTC outputs (default: False).
- ctc_trim (bool) – Flag to enable CTC trimming (default: False).
- stochastic_depth_rate (Union *[*float , List *[*float ] ]) – Rate for stochastic depth (default: 0.0).
- layer_drop_rate (float) – Dropout rate for layers (default: 0.0).
- max_pos_emb_len (int) – Maximum length for positional embeddings (default: 5000).
- qk_norm (bool) – Flag to apply normalization on query-key pairs (default: False).
- use_flash_attn (bool) – Flag to use Flash Attention (default: True).

######### Examples

>>> encoder = ConformerEncoder(input_size=80, output_size=256)
>>> xs_pad = torch.randn(32, 100, 80)  # Batch of 32, 100 time steps, 80 features
>>> ilens = torch.tensor([100] * 32)    # All inputs are of length 100
>>> output, olens, _ = encoder(xs_pad, ilens)

Raises:
- ValueError – If an unknown rel_pos_type or pos_enc_layer_type is provided.
- TooShortUttError – If the input sequence length is shorter than the required length for subsampling.

####### NOTE This implementation utilizes various configurations for the input layers and encoder blocks to optimize performance on different types of input data.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None, ctc: CTC | None = None, return_all_hs: bool = False) → Tuple[Tensor, Tensor, Tensor | None]

Calculate forward propagation through the Conformer encoder.

This method performs the forward pass for the Conformer encoder, which includes embedding the input, applying several encoder layers, and optionally returning all hidden states.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (#batch, L, input_size).
- ilens (torch.Tensor) – Input lengths of shape (#batch).
- prev_states (torch.Tensor , optional) – Not currently used. Defaults to None.
- ctc (CTC , optional) – CTC module for intermediate CTC loss. Defaults to None.
- return_all_hs (bool , optional) – Flag to indicate if all hidden states should be returned. Defaults to False.
Returns:
- Output tensor of shape (#batch, L, output_size).
- Output lengths of shape (#batch).
- Optional tensor, not currently used (None).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:TooShortUttError – If the input sequence length is too short for subsampling layers.

######### Examples

>>> encoder = ConformerEncoder(input_size=80)
>>> xs_pad = torch.randn(32, 100, 80)  # Batch of 32, 100 time steps
>>> ilens = torch.full((32,), 100)  # All sequences are of length 100
>>> output, olens, _ = encoder.forward(xs_pad, ilens)
>>> print(output.shape)  # Output shape will be (#batch, L, output_size)
>>> print(olens.shape)  # Output lengths shape will be (#batch)

####### NOTE This method modifies the input tensor and should be used with caution when handling gradient computations.

output_size

() → int

output_size method.

This method retrieves the output size of the Conformer encoder. The output size is determined during the initialization of the encoder and represents the dimension of the attention mechanism.

Returns: The output size of the encoder.
Return type: int

######### Examples

>>> encoder = ConformerEncoder(input_size=128, output_size=256)
>>> encoder.output_size()
256

####### NOTE This method is primarily used to obtain the output size for subsequent layers or operations in a neural network pipeline.