espnet2.asr.layers.fastformer.FastSelfAttention

About 3 min

espnet2.asr.layers.fastformer.FastSelfAttention

class espnet2.asr.layers.fastformer.FastSelfAttention(size, attention_heads, dropout_rate)

Bases: Module

Fast self-attention mechanism used in Fastformer.

This class implements the Fast Self-Attention mechanism as described in the paper “Fastformer: Additive Attention Can Be All You Need” by Wu et al. The FastSelfAttention layer is designed to efficiently compute attention scores and update representations in sequence-to-sequence models.

attention_head_size

The size of each attention head.

Type: int

num_attention_heads

The number of attention heads.

Type: int

query

Linear layer for query transformation.

Type: torch.nn.Linear

query

_att

Linear layer for query attention scores.

Type: torch.nn.Linear

key

Linear layer for key transformation.

Type: torch.nn.Linear

key

_att

Linear layer for key attention scores.

Type: torch.nn.Linear

transform

Linear layer for final transformation.

Type: torch.nn.Linear

dropout

Dropout layer for regularization.

Type: torch.nn.Dropout
Parameters:
- size (int) – Total size of the input features.
- attention_heads (int) – Number of attention heads to use.
- dropout_rate (float) – Dropout rate to apply to the outputs.
Raises:ValueError – If size is not an integer multiple of attention_heads.

########### Examples

>>> fast_attention = FastSelfAttention(size=64, attention_heads=8,
...                                     dropout_rate=0.1)
>>> xs_pad = torch.randn(32, 10, 64)  # (batch, time, size)
>>> mask = torch.ones(32, 1, 10)  # Non-padding mask
>>> output = fast_attention(xs_pad, mask)
>>> print(output.shape)  # Should output: (32, 10, 64)

####### NOTE The implementation uses query-key-value parameter sharing for computational efficiency, treating the value as equal to the query.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

espnet_initialization_fn()

Initializes the weights of the FastSelfAttention module.

This method applies the weight initialization strategy defined in the init_weights method to all submodules of the FastSelfAttention instance. The initialization is done using a normal distribution for the weights and zeros for the biases of linear layers.

####### NOTE This function should be called after the model has been instantiated to ensure that all weights are initialized correctly.

########### Examples

>>> attention_layer = FastSelfAttention(size=128, attention_heads=8,
...                                     dropout_rate=0.1)
>>> attention_layer.espnet_initialization_fn()  # Initialize weights

Raises:
- ValueError – If the weight initialization process fails or if there
- are no linear layers in the module. –

forward(xs_pad, mask)

Compute the forward pass for the FastSelfAttention layer.

This method performs the forward computation for the FastSelfAttention layer. It takes input embeddings and computes attention weights to produce the output embeddings.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (batch, time, size = n_heads * attn_dim), where ‘batch’ is the number of sequences, ‘time’ is the sequence length, and ‘size’ is the dimensionality of the input embeddings.
- mask (torch.Tensor) – A binary tensor of shape (batch, 1, time), where non-padding positions are represented by 1 and padding positions by 0. This mask is used to ignore padding tokens during attention calculation.
Returns: Output tensor of shape : (batch, time, size), which represents the attention weighted output embeddings.
Return type: torch.Tensor

########### Examples

>>> model = FastSelfAttention(size=64, attention_heads=8,
... dropout_rate=0.1)
>>> xs_pad = torch.randn(32, 10, 64)  # batch of 32, seq_len of 10
>>> mask = torch.ones(32, 1, 10)  # no padding
>>> output = model(xs_pad, mask)
>>> output.shape
torch.Size([32, 10, 64])

####### NOTE The attention mechanism used here is based on the Fastformer architecture which leverages additive attention.

Raises:ValueError – If the input size is not an integer multiple of the number of attention heads.

transpose_for_scores(x)

Reshape and transpose input tensor for attention score computation.

This method reshapes the input tensor x from a shape of (batch, time, size) to (batch, n_heads, time, attn_dim) by splitting the last dimension into the number of attention heads and the size of each attention head. This transformation is essential for computing attention scores across multiple heads.

Parameters:x (torch.Tensor) – Input tensor of shape (batch, time, size), where size is equal to n_heads * attn_dim.
Returns: Reshaped tensor of shape : (batch, n_heads, time, attn_dim).
Return type: torch.Tensor

########### Examples

>>> attention = FastSelfAttention(size=64, attention_heads=4,
... dropout_rate=0.1)
>>> x = torch.randn(2, 10, 64)  # (batch, time, size)
>>> transposed_x = attention.transpose_for_scores(x)
>>> transposed_x.shape
torch.Size([2, 4, 10, 16])  # (batch, n_heads, time, attn_dim)