espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoder

About 4 min

espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoder

class espnet2.asr.encoder.e_branchformer_encoder.EBranchformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str | None = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, use_ffn: bool = False, macaron_ffn: bool = False, ffn_activation_type: str = 'swish', linear_units: int = 2048, positionwise_layer_type: str = 'linear', merge_conv_kernel: int = 3, interctc_layer_idx=None, interctc_use_conditioning: bool = False, qk_norm: bool = False, use_flash_attn: bool = True)

Bases: AbsEncoder

E-Branchformer encoder module for speech recognition.

This module implements the E-Branchformer architecture as described in the paper “E-Branchformer: Branchformer with Enhanced merging for speech recognition,” presented at SLT 2022. It leverages multiple layers of attention and Convolutional Gating MLPs to encode input speech features.

_output_size

The dimensionality of the output features.

Type: int

embed

The embedding layer for input features.

Type: torch.nn.Module

encoders

A list of EBranchformerEncoderLayer instances that make up the encoder.

Type: torch.nn.ModuleList

after_norm

Layer normalization applied to the final output.

Type:LayerNorm

interctc_layer_idx

Indices of layers where intermediate CTC outputs are calculated.

Type: list

interctc_use_conditioning

Flag indicating whether to use conditioning for CTC outputs.

Type: bool
Parameters:
- input_size (int) – The dimensionality of input features.
- output_size (int , optional) – The dimensionality of output features. Defaults to 256.
- attention_heads (int , optional) – The number of attention heads. Defaults to 4.
- attention_layer_type (str , optional) – Type of attention layer to use. Options include ‘selfattn’, ‘rel_selfattn’, etc. Defaults to ‘rel_selfattn’.
- pos_enc_layer_type (str , optional) – Type of positional encoding. Defaults to ‘rel_pos’.
- rel_pos_type (str , optional) – Type of relative positional encoding. Defaults to ‘latest’.
- cgmlp_linear_units (int , optional) – The number of linear units in the Convolutional Gating MLP. Defaults to 2048.
- cgmlp_conv_kernel (int , optional) – Kernel size for the convolutional layers in CGMLP. Defaults to 31.
- use_linear_after_conv (bool , optional) – Whether to apply a linear transformation after convolution in CGMLP. Defaults to False.
- gate_activation (str , optional) – Activation function used in gating. Defaults to ‘identity’.
- num_blocks (int , optional) – Number of encoder blocks. Defaults to 12.
- dropout_rate (float , optional) – Dropout probability. Defaults to 0.1.
- positional_dropout_rate (float , optional) – Dropout probability for positional encodings. Defaults to 0.1.
- attention_dropout_rate (float , optional) – Dropout probability for attention. Defaults to 0.0.
- input_layer (str or None , optional) – Type of input layer. Options include ‘conv2d’, ‘linear’, etc. Defaults to ‘conv2d’.
- zero_triu (bool , optional) – Whether to zero out the upper triangular part of the attention matrix. Defaults to False.
- padding_idx (int , optional) – Padding index for embedding layers. Defaults to -1.
- layer_drop_rate (float , optional) – Dropout rate for individual layers. Defaults to 0.0.
- max_pos_emb_len (int , optional) – Maximum length for positional embeddings. Defaults to 5000.
- use_ffn (bool , optional) – Whether to use feed-forward networks. Defaults to False.
- macaron_ffn (bool , optional) – Whether to use macaron-style feed-forward networks. Defaults to False.
- ffn_activation_type (str , optional) – Activation function for feed-forward networks. Defaults to ‘swish’.
- linear_units (int , optional) – Number of linear units in the feed-forward networks. Defaults to 2048.
- positionwise_layer_type (str , optional) – Type of positionwise layer. Defaults to ‘linear’.
- merge_conv_kernel (int , optional) – Kernel size for merging convolutional layers. Defaults to 3.
- interctc_layer_idx (list , optional) – Indices for intermediate CTC layers. Defaults to None.
- interctc_use_conditioning (bool , optional) – Whether to use conditioning for intermediate CTC. Defaults to False.
- qk_norm (bool , optional) – Whether to use normalization for query-key vectors. Defaults to False.
- use_flash_attn (bool , optional) – Whether to use flash attention. Defaults to True.
Returns: None

######### Examples

>>> encoder = EBranchformerEncoder(input_size=80, output_size=256)
>>> xs_pad = torch.randn(32, 100, 80)  # (batch_size, seq_length, input_size)
>>> ilens = torch.randint(1, 100, (32,))  # (batch_size)
>>> output, olens, _ = encoder(xs_pad, ilens)
>>> print(output.shape)  # Expected: (32, 100, 256)

NOTE

The implementation includes various input layer types and attention mechanisms, providing flexibility in configuring the encoder for different tasks and datasets.

Raises:
- ValueError – If unknown types for positional encoding or attention layer
- types are provided. –

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None, ctc: CTC | None = None, max_layer: int | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Compute encoded features.

This method processes the input tensor through the E-Branchformer encoder layer, applying multi-headed attention and convolutional gating MLP operations. It can handle inputs with or without positional embeddings and returns the encoded output along with the mask.

Parameters:
- x_input (Union *[*Tuple , torch.Tensor ]) –
  Input tensor with or without positional embeddings.
  - If with positional embeddings: Tuple of tensors
  [(#batch, time, size), (1, time, size)].
  - If without positional embeddings: Tensor of shape (#batch, time, size).
- mask (torch.Tensor) – Mask tensor for the input of shape (#batch, 1, time).
- cache (Optional *[*torch.Tensor ]) – Cache tensor of the input of shape (#batch, time - 1, size). If provided, caching is expected to be handled, but this is currently not implemented.
Returns: Output tensor of shape (#batch, time, size) and the mask tensor of shape (#batch, time). If positional embeddings are provided, returns a tuple containing the output tensor and positional embeddings.
Return type: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor]
Raises:NotImplementedError – If cache is not None, as caching functionality is not implemented yet.

######### Examples

>>> encoder_layer = EBranchformerEncoderLayer(...)
>>> x_input = torch.randn(32, 10, 256)  # Example input
>>> mask = torch.ones(32, 1, 10)  # Example mask
>>> output, mask_out = encoder_layer(x_input, mask)

NOTE

Ensure that the input tensor and mask are correctly formatted to avoid dimension mismatch errors.

output_size() → int

Retrieve the output size of the E-Branchformer encoder.

This method returns the size of the output features produced by the encoder, which is essential for configuring downstream tasks such as classification or sequence generation.

Returns: The output size of the encoder.
Return type: int

######### Examples

>>> encoder = EBranchformerEncoder(input_size=128, output_size=256)
>>> encoder.output_size()
256