espnet2.asr.encoder.e_branchformer_ctc_encoder.EBranchformerCTCEncoder

About 3 min

espnet2.asr.encoder.e_branchformer_ctc_encoder.EBranchformerCTCEncoder

class espnet2.asr.encoder.e_branchformer_ctc_encoder.EBranchformerCTCEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str | None = 'conv2d8', zero_triu: bool = False, padding_idx: int = -1, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, use_ffn: bool = False, macaron_ffn: bool = False, ffn_activation_type: str = 'swish', linear_units: int = 2048, positionwise_layer_type: str = 'linear', merge_conv_kernel: int = 3, interctc_layer_idx=None, interctc_use_conditioning: bool = False, use_cross_attention=True, use_flash_attn: bool = False)

Bases: AbsEncoder

E-Branchformer encoder module.

This module implements the E-Branchformer encoder which enhances the original encoder with support for additional cross-attention modules and extra prefix tokens for language and task conditioning.

_output_size

The output size of the encoder.

Type: int
Parameters:
- input_size (int) – The dimensionality of the input features.
- output_size (int , optional) – The dimensionality of the output features. Defaults to 256.
- attention_heads (int , optional) – The number of attention heads. Defaults to 4.
- attention_layer_type (str , optional) – The type of attention layer to use. Defaults to “rel_selfattn”.
- pos_enc_layer_type (str , optional) – The type of positional encoding. Defaults to “rel_pos”.
- rel_pos_type (str , optional) – The type of relative position encoding. Defaults to “latest”.
- cgmlp_linear_units (int , optional) – The number of linear units in CG-MLP. Defaults to 2048.
- cgmlp_conv_kernel (int , optional) – The convolutional kernel size in CG-MLP. Defaults to 31.
- use_linear_after_conv (bool , optional) – Whether to use a linear layer after the convolution. Defaults to False.
- gate_activation (str , optional) – The activation function for gating. Defaults to “identity”.
- num_blocks (int , optional) – The number of encoder blocks. Defaults to 12.
- dropout_rate (float , optional) – The dropout rate for the encoder. Defaults to 0.1.
- positional_dropout_rate (float , optional) – The dropout rate for positional encodings. Defaults to 0.1.
- attention_dropout_rate (float , optional) – The dropout rate for attention. Defaults to 0.0.
- input_layer (str or torch.nn.Module , optional) – The type of input layer. Defaults to “conv2d8”.
- zero_triu (bool , optional) – Whether to zero out the upper triangular part of the attention matrix. Defaults to False.
- padding_idx (int , optional) – The index used for padding in embeddings. Defaults to -1.
- layer_drop_rate (float , optional) – The dropout rate for layers. Defaults to 0.0.
- max_pos_emb_len (int , optional) – The maximum length for positional embeddings. Defaults to 5000.
- use_ffn (bool , optional) – Whether to use feed-forward networks. Defaults to False.
- macaron_ffn (bool , optional) – Whether to use macaron-style feed-forward networks. Defaults to False.
- ffn_activation_type (str , optional) – The activation function for feed-forward networks. Defaults to “swish”.
- linear_units (int , optional) – The number of linear units in feed-forward networks. Defaults to 2048.
- positionwise_layer_type (str , optional) – The type of position-wise layer. Defaults to “linear”.
- merge_conv_kernel (int , optional) – The kernel size for merging convolutions. Defaults to 3.
- interctc_layer_idx (list , optional) – Indices of layers where intermediate CTC is applied. Defaults to None.
- interctc_use_conditioning (bool , optional) – Whether to use conditioning for intermediate CTC. Defaults to False.
- use_cross_attention (bool or list of bool , optional) – Whether to use cross attention. Defaults to True.
- use_flash_attn (bool , optional) – Whether to use flash attention. Defaults to False.
Returns: Initializes the E-Branchformer encoder.
Return type: None

######### Examples

>>> encoder = EBranchformerCTCEncoder(input_size=80)
>>> xs_pad = torch.randn(32, 100, 80)  # (batch, length, input_size)
>>> ilens = torch.tensor([100] * 32)    # (batch)
>>> output, olens, _ = encoder(xs_pad, ilens)
>>> print(output.shape)  # Output shape will be (32, 100, output_size)

NOTE

Ensure that the input size matches the expected dimensionality of the input features.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Calculate forward propagation.

This method computes the forward pass of the E-Branchformer CTC encoder, processing the input tensor through multiple encoder layers and applying any specified cross-attention mechanisms. The method handles padding, applies dropout, and manages various input configurations.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (#batch, L, input_size).
- ilens (torch.Tensor) – Input lengths of shape (#batch).
- prev_states (torch.Tensor , optional) – Not currently used.
- ctc (CTC , optional) – Intermediate CTC module for connectionist temporal classification.
- max_layer (int , optional) – Maximum layer depth below which InterCTC is applied.
- prefix_embeds (torch.tensor , optional) – Additional embeddings for input conditioning, shape (batch, 2, output_size).
- memory (torch.Tensor , optional) – Memory tensor for cross-attention, if applicable.
- memory_mask (torch.Tensor , optional) – Mask for the memory tensor.
Returns: A tuple containing:
- Output tensor of shape (#batch, L, output_size).
- Output lengths of shape (#batch).
- Placeholder tensor, currently not used.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:TooShortUttError – If the input tensor is too short for subsampling.

######### Examples

>>> encoder = EBranchformerCTCEncoder(input_size=256, output_size=512)
>>> input_tensor = torch.randn(8, 100, 256)  # Batch of 8, 100 timesteps
>>> input_lengths = torch.tensor([100] * 8)  # All inputs are 100 long
>>> output, output_lengths, _ = encoder.forward(input_tensor, input_lengths)

NOTE

The method supports multiple input configurations, including prefix embeddings for enhanced language and task conditioning.

output_size() → int

Get the output size of the encoder.

This method returns the size of the output tensor generated by the encoder. The output size is determined during the initialization of the encoder and is typically set to the number of units in the final layer of the model.

Returns: The output size of the encoder.
Return type: int

######### Examples

>>> encoder = EBranchformerCTCEncoder(input_size=128, output_size=256)
>>> encoder.output_size()
256