espnet2.asr.encoder.beats_encoder.TransformerSentenceEncoderLayer
espnet2.asr.encoder.beats_encoder.TransformerSentenceEncoderLayer
class espnet2.asr.encoder.beats_encoder.TransformerSentenceEncoderLayer(embedding_dim: float = 768, ffn_embedding_dim: float = 3072, num_attention_heads: float = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.1, activation_fn: str = 'relu', layer_norm_first: bool = False, deep_norm: bool = False, has_relative_attention_bias: bool = False, num_buckets: int = 0, max_distance: int = 0, rescale_init: bool = False, gru_rel_pos: bool = False, encoder_layers: int = 0)
Bases: Module
Transformer encoder layer for sentence encoding.
This class implements a single layer of the Transformer encoder, which processes input sequences using self-attention mechanisms and feedforward networks. It allows for various configurations including dropout rates, activation functions, and normalization strategies.
embedding_dim
The dimension of the input embeddings.
- Type: float
dropout
The dropout probability applied to the output.
- Type: float
activation_dropout
The dropout probability applied after the activation function in the feedforward network.
- Type: float
activation_fn
The activation function used in the feedforward network.
- Type: callable
self_attn
The multi-headed attention mechanism.
- Type:MultiheadAttention
fc1
The first linear layer in the feedforward network.
- Type: nn.Linear
fc2
The second linear layer in the feedforward network.
- Type: nn.Linear
layer_norm_first
If True, applies layer normalization before the attention mechanism.
- Type: bool
final_layer_norm
The layer normalization applied at the end of the layer.
Type:LayerNorm
Parameters:
- embedding_dim (float) – Dimension of the input embeddings. Default is 768.
- ffn_embedding_dim (float) – Dimension of the feedforward network. Default is 3072.
- num_attention_heads (float) – Number of attention heads in the multi-head attention mechanism. Default is 8.
- dropout (float) – Dropout probability for the output. Default is 0.1.
- attention_dropout (float) – Dropout probability for attention weights. Default is 0.1.
- activation_dropout (float) – Dropout probability after activation in the feedforward network. Default is 0.1.
- activation_fn (str) – Activation function to use. Default is “relu”.
- layer_norm_first (bool) – If True, applies layer normalization before the attention. Default is False.
- deep_norm (bool) – If True, applies deep normalization. Default is False.
- has_relative_attention_bias (bool) – If True, enables relative attention bias. Default is False.
- num_buckets (int) – Number of buckets for relative position encoding. Default is 0.
- max_distance (int) – Maximum distance for relative position encoding. Default is 0.
- rescale_init (bool) – If True, rescales initialization. Default is False.
- gru_rel_pos (bool) – If True, uses gated relative position encoding. Default is False.
- encoder_layers (int) – Total number of encoder layers. Default is 0.
####### Examples
>>> layer = TransformerSentenceEncoderLayer()
>>> input_tensor = torch.rand(10, 32, 768) # (seq_len, batch_size, embedding_dim)
>>> output = layer(input_tensor)
NOTE
The input to the forward method should be of shape (T, B, C) where T is the sequence length, B is the batch size, and C is the embedding dimension.
- Raises:AssertionError – If the input tensor does not match the expected dimensions.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x: Tensor, self_attn_mask: Tensor | None = None, self_attn_padding_mask: Tensor | None = None, need_weights: bool = False, pos_bias=None)
Wrapper for compatibility with ESPnet’s AbsEncoder Interface.
This method processes input tensors through the encoder to produce audio representations. It handles padding masks and manages the input tensor shapes to ensure compatibility with the BEATs encoder.
Parameters:
- xs_pad (torch.Tensor) – A tensor of shape (B, T, D) representing the padded input sequences, where B is the batch size, T is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – A tensor of shape (B,) containing the lengths of the input sequences before padding.
- prev_states (torch.Tensor , optional) – Previous hidden states from the encoder. Default is None.
Returns:
- audio_representation (torch.Tensor): A tensor of shape
(B, T, D) containing the encoded audio representations.
- output_lens (torch.Tensor): A tensor of shape (B,) containing the lengths of the output sequences.
- masks (Optional[torch.Tensor]): This is None as masks are not returned in this implementation.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
####### Examples
>>> encoder = BeatsEncoder(input_size=128)
>>> xs_pad = torch.rand(10, 20, 128) # 10 samples, 20 time steps, 128 features
>>> ilens = torch.tensor([20] * 10) # all sequences are of length 20
>>> audio_rep, output_lens, masks = encoder.forward(xs_pad, ilens)
>>> print(audio_rep.shape) # should print: torch.Size([10, 20, D])
>>> print(output_lens.shape) # should print: torch.Size([10])
NOTE
If the input tensor xs_pad is not provided, this function will create a tensor of size maxlen x maxlen, which can be costly in terms of computation. To mitigate this, the function squeezes and adjusts the tensor shapes as necessary.
- Raises:
- ValueError – If the input tensor shapes do not match expected
- dimensions or if the lengths are invalid. –