espnet2.asr.encoder.longformer_encoder.LongformerEncoder

About 3 min

espnet2.asr.encoder.longformer_encoder.LongformerEncoder

class espnet2.asr.encoder.longformer_encoder.LongformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')

Bases: ConformerEncoder

Longformer Self-Attention Conformer Encoder Module.

This class implements a Longformer-based encoder for automatic speech recognition (ASR). It leverages self-attention mechanisms to handle long sequences efficiently.

_output_size

The output dimension of the encoder.

Type: int

embed

The embedding layer that processes input.

Type: torch.nn.Module

normalize_before

Flag indicating if normalization is applied before the first block.

Type: bool

encoders

A list of encoder layers.

Type: List[EncoderLayer]

after_norm

Layer normalization applied after encoding if normalize_before is True.

Type:LayerNorm

interctc_layer_idx

Indices of layers used for intermediate CTC loss.

Type: List[int]

interctc_use_conditioning

Flag indicating if conditioning is used for intermediate CTC outputs.

Type: bool
Parameters:
- input_size (int) – Input dimension.
- output_size (int) – Dimension of attention. Default is 256.
- attention_heads (int) – The number of heads in multi-head attention.
- linear_units (int) – Number of units in position-wise feed forward. Default is 2048.
- num_blocks (int) – Number of encoder blocks. Default is 6.
- dropout_rate (float) – Dropout rate. Default is 0.1.
- positional_dropout_rate (float) – Dropout rate for positional encoding. Default is 0.1.
- attention_dropout_rate (float) – Dropout rate in attention. Default is 0.0.
- input_layer (Union *[*str , torch.nn.Module ]) – Type of input layer. Default is “conv2d”.
- normalize_before (bool) – Whether to apply layer normalization before the first block. Default is True.
- concat_after (bool) – Whether to concatenate input and output of attention layers. Default is False.
- positionwise_layer_type (str) – Type of position-wise layer. Default is “linear”.
- positionwise_conv_kernel_size (int) – Kernel size for position-wise conv1d. Default is 3.
- rel_pos_type (str) – Type of relative positional encoding. Default is “legacy”.
- encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type. Default is “abs_pos”.
- encoder_attn_layer_type (str) – Encoder attention layer type. Default is “lf_selfattn”.
- activation_type (str) – Activation function type. Default is “swish”.
- macaron_style (bool) – Whether to use Macaron style for position-wise layers. Default is False.
- use_cnn_module (bool) – Whether to use a convolution module. Default is True.
- zero_triu (bool) – Whether to zero the upper triangular part of the attention matrix. Default is False.
- cnn_module_kernel (int) – Kernel size of the convolution module. Default is 31.
- padding_idx (int) – Padding index for embedding layer. Default is -1.
- attention_windows (list) – Layer-wise attention window sizes for Longformer self-attention.
- attention_dilation (list) – Layer-wise attention dilation sizes for Longformer self-attention.
- attention_mode (str) – Implementation mode for Longformer self-attention. Default is “sliding_chunks”. More details in https://github.com/allenai/longformer
Raises:ValueError – If parameters are incorrect or do not match expected lengths.

######### Examples

>>> encoder = LongformerEncoder(
...     input_size=80,
...     output_size=256,
...     attention_heads=4,
...     linear_units=2048,
...     num_blocks=6,
...     dropout_rate=0.1,
...     attention_windows=[100]*6,
...     attention_dilation=[1]*6,
...     attention_mode="sliding_chunks"
... )
>>> xs_pad = torch.randn(32, 100, 80)  # Example input tensor
>>> ilens = torch.tensor([100]*32)  # Example input lengths
>>> output, olens, _ = encoder(xs_pad, ilens)

NOTE

The Longformer architecture is particularly effective for handling long sequences due to its efficient attention mechanism.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None, ctc: CTC | None = None, return_all_hs: bool = False) → Tuple[Tensor, Tensor, Tensor | None]

Calculate forward propagation through the Longformer encoder.

This method processes the input tensor through the Longformer encoder layers and returns the output tensor along with the output lengths and optional intermediate hidden states.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (#batch, L, input_size).
- ilens (torch.Tensor) – Tensor of input lengths with shape (#batch).
- prev_states (torch.Tensor) – Previous states (not used currently).
- ctc (CTC) – CTC module for intermediate CTC loss computation.
- return_all_hs (bool) – Flag indicating whether to return all hidden states.
Returns:
- Output tensor of shape (#batch, L, output_size).
- Output lengths tensor of shape (#batch).
- Optional tensor (currently not used).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:TooShortUttError – If the input sequence is too short for the subsampling layer.

######### Examples

>>> model = LongformerEncoder(input_size=128, output_size=256)
>>> xs_pad = torch.randn(32, 50, 128)  # 32 batches, 50 length, 128 features
>>> ilens = torch.tensor([50] * 32)  # All sequences are of length 50
>>> output, olens, _ = model.forward(xs_pad, ilens)

NOTE

The prev_states parameter is reserved for future use and currently does not influence the forward pass.

output_size() → int

Get the output size of the Longformer encoder.

This method returns the output dimension of the Longformer encoder, which is defined during the initialization of the encoder. The output size is crucial for ensuring that the subsequent layers in the model receive the correct input dimensions.

Returns: The output dimension of the encoder.
Return type: int

######### Examples

>>> encoder = LongformerEncoder(input_size=512, output_size=256)
>>> encoder.output_size()
256