espnet2.asr.layers.fastformer.FastSelfAttention
espnet2.asr.layers.fastformer.FastSelfAttention
class espnet2.asr.layers.fastformer.FastSelfAttention(size, attention_heads, dropout_rate)
Bases: Module
Fast self-attention mechanism used in Fastformer.
This class implements the Fast Self-Attention mechanism as described in the paper “Fastformer: Additive Attention Can Be All You Need” by Wu et al. The FastSelfAttention layer is designed to efficiently compute attention scores and update representations in sequence-to-sequence models.
attention_head_size
The size of each attention head.
- Type: int
num_attention_heads
The number of attention heads.
- Type: int
query
Linear layer for query transformation.
- Type: torch.nn.Linear
query
Linear layer for query attention scores.
- Type: torch.nn.Linear
key
Linear layer for key transformation.
- Type: torch.nn.Linear
key
Linear layer for key attention scores.
- Type: torch.nn.Linear
transform
Linear layer for final transformation.
- Type: torch.nn.Linear
dropout
Dropout layer for regularization.
Type: torch.nn.Dropout
Parameters:
- size (int) – Total size of the input features.
- attention_heads (int) – Number of attention heads to use.
- dropout_rate (float) – Dropout rate to apply to the outputs.
Raises:ValueError – If size is not an integer multiple of attention_heads.
########### Examples
>>> fast_attention = FastSelfAttention(size=64, attention_heads=8,
... dropout_rate=0.1)
>>> xs_pad = torch.randn(32, 10, 64) # (batch, time, size)
>>> mask = torch.ones(32, 1, 10) # Non-padding mask
>>> output = fast_attention(xs_pad, mask)
>>> print(output.shape) # Should output: (32, 10, 64)
####### NOTE The implementation uses query-key-value parameter sharing for computational efficiency, treating the value as equal to the query.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
espnet_initialization_fn()
Initializes the weights of the FastSelfAttention module.
This method applies the weight initialization strategy defined in the init_weights method to all submodules of the FastSelfAttention instance. The initialization is done using a normal distribution for the weights and zeros for the biases of linear layers.
####### NOTE This function should be called after the model has been instantiated to ensure that all weights are initialized correctly.
########### Examples
>>> attention_layer = FastSelfAttention(size=128, attention_heads=8,
... dropout_rate=0.1)
>>> attention_layer.espnet_initialization_fn() # Initialize weights
- Raises:
- ValueError – If the weight initialization process fails or if there
- are no linear layers in the module. –
forward(xs_pad, mask)
Compute the forward pass for the FastSelfAttention layer.
This method performs the forward computation for the FastSelfAttention layer. It takes input embeddings and computes attention weights to produce the output embeddings.
- Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (batch, time, size = n_heads * attn_dim), where ‘batch’ is the number of sequences, ‘time’ is the sequence length, and ‘size’ is the dimensionality of the input embeddings.
- mask (torch.Tensor) – A binary tensor of shape (batch, 1, time), where non-padding positions are represented by 1 and padding positions by 0. This mask is used to ignore padding tokens during attention calculation.
- Returns: Output tensor of shape : (batch, time, size), which represents the attention weighted output embeddings.
- Return type: torch.Tensor
########### Examples
>>> model = FastSelfAttention(size=64, attention_heads=8,
... dropout_rate=0.1)
>>> xs_pad = torch.randn(32, 10, 64) # batch of 32, seq_len of 10
>>> mask = torch.ones(32, 1, 10) # no padding
>>> output = model(xs_pad, mask)
>>> output.shape
torch.Size([32, 10, 64])
####### NOTE The attention mechanism used here is based on the Fastformer architecture which leverages additive attention.
- Raises:ValueError – If the input size is not an integer multiple of the number of attention heads.
transpose_for_scores(x)
Reshape and transpose input tensor for attention score computation.
This method reshapes the input tensor x from a shape of (batch, time, size) to (batch, n_heads, time, attn_dim) by splitting the last dimension into the number of attention heads and the size of each attention head. This transformation is essential for computing attention scores across multiple heads.
- Parameters:x (torch.Tensor) – Input tensor of shape (batch, time, size), where size is equal to n_heads * attn_dim.
- Returns: Reshaped tensor of shape : (batch, n_heads, time, attn_dim).
- Return type: torch.Tensor
########### Examples
>>> attention = FastSelfAttention(size=64, attention_heads=4,
... dropout_rate=0.1)
>>> x = torch.randn(2, 10, 64) # (batch, time, size)
>>> transposed_x = attention.transpose_for_scores(x)
>>> transposed_x.shape
torch.Size([2, 4, 10, 16]) # (batch, n_heads, time, attn_dim)