espnet2.asr.encoder.beats_encoder.TransformerSentenceEncoderLayer

About 3 min

espnet2.asr.encoder.beats_encoder.TransformerSentenceEncoderLayer

class espnet2.asr.encoder.beats_encoder.TransformerSentenceEncoderLayer(embedding_dim: float = 768, ffn_embedding_dim: float = 3072, num_attention_heads: float = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.1, activation_fn: str = 'relu', layer_norm_first: bool = False, deep_norm: bool = False, has_relative_attention_bias: bool = False, num_buckets: int = 0, max_distance: int = 0, rescale_init: bool = False, gru_rel_pos: bool = False, encoder_layers: int = 0)

Bases: Module

Transformer encoder layer for sentence encoding.

This class implements a single layer of the Transformer encoder, which processes input sequences using self-attention mechanisms and feedforward networks. It allows for various configurations including dropout rates, activation functions, and normalization strategies.

embedding_dim

The dimension of the input embeddings.

Type: float

dropout

The dropout probability applied to the output.

Type: float

activation_dropout

The dropout probability applied after the activation function in the feedforward network.

Type: float

activation_fn

The activation function used in the feedforward network.

Type: callable

self_attn

The multi-headed attention mechanism.

Type:MultiheadAttention

fc1

The first linear layer in the feedforward network.

Type: nn.Linear

fc2

The second linear layer in the feedforward network.

Type: nn.Linear

layer_norm_first

If True, applies layer normalization before the attention mechanism.

Type: bool

final_layer_norm

The layer normalization applied at the end of the layer.

Type:LayerNorm
Parameters:
- embedding_dim (float) – Dimension of the input embeddings. Default is 768.
- ffn_embedding_dim (float) – Dimension of the feedforward network. Default is 3072.
- num_attention_heads (float) – Number of attention heads in the multi-head attention mechanism. Default is 8.
- dropout (float) – Dropout probability for the output. Default is 0.1.
- attention_dropout (float) – Dropout probability for attention weights. Default is 0.1.
- activation_dropout (float) – Dropout probability after activation in the feedforward network. Default is 0.1.
- activation_fn (str) – Activation function to use. Default is “relu”.
- layer_norm_first (bool) – If True, applies layer normalization before the attention. Default is False.
- deep_norm (bool) – If True, applies deep normalization. Default is False.
- has_relative_attention_bias (bool) – If True, enables relative attention bias. Default is False.
- num_buckets (int) – Number of buckets for relative position encoding. Default is 0.
- max_distance (int) – Maximum distance for relative position encoding. Default is 0.
- rescale_init (bool) – If True, rescales initialization. Default is False.
- gru_rel_pos (bool) – If True, uses gated relative position encoding. Default is False.
- encoder_layers (int) – Total number of encoder layers. Default is 0.

####### Examples

>>> layer = TransformerSentenceEncoderLayer()
>>> input_tensor = torch.rand(10, 32, 768)  # (seq_len, batch_size, embedding_dim)
>>> output = layer(input_tensor)

NOTE

The input to the forward method should be of shape (T, B, C) where T is the sequence length, B is the batch size, and C is the embedding dimension.

Raises:AssertionError – If the input tensor does not match the expected dimensions.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, self_attn_mask: Tensor | None = None, self_attn_padding_mask: Tensor | None = None, need_weights: bool = False, pos_bias=None)

Wrapper for compatibility with ESPnet’s AbsEncoder Interface.

This method processes input tensors through the encoder to produce audio representations. It handles padding masks and manages the input tensor shapes to ensure compatibility with the BEATs encoder.

Parameters:
- xs_pad (torch.Tensor) – A tensor of shape (B, T, D) representing the padded input sequences, where B is the batch size, T is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – A tensor of shape (B,) containing the lengths of the input sequences before padding.
- prev_states (torch.Tensor , optional) – Previous hidden states from the encoder. Default is None.
Returns:
- audio_representation (torch.Tensor): A tensor of shape
(B, T, D) containing the encoded audio representations.
- output_lens (torch.Tensor): A tensor of shape (B,) containing the lengths of the output sequences.
- masks (Optional[torch.Tensor]): This is None as masks are not returned in this implementation.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

####### Examples

>>> encoder = BeatsEncoder(input_size=128)
>>> xs_pad = torch.rand(10, 20, 128)  # 10 samples, 20 time steps, 128 features
>>> ilens = torch.tensor([20] * 10)  # all sequences are of length 20
>>> audio_rep, output_lens, masks = encoder.forward(xs_pad, ilens)
>>> print(audio_rep.shape)  # should print: torch.Size([10, 20, D])
>>> print(output_lens.shape)  # should print: torch.Size([10])

NOTE

If the input tensor xs_pad is not provided, this function will create a tensor of size maxlen x maxlen, which can be costly in terms of computation. To mitigate this, the function squeezes and adjusts the tensor shapes as necessary.

Raises:
- ValueError – If the input tensor shapes do not match expected
- dimensions or if the lengths are invalid. –