espnet2.asr.encoder.beats_encoder.GLU_Linear

About 2 min

espnet2.asr.encoder.beats_encoder.GLU_Linear

class espnet2.asr.encoder.beats_encoder.GLU_Linear(input_dim, output_dim, glu_type='sigmoid', bias_in_glu=True)

Bases: Module

GLU Linear layer.

This class implements a Gated Linear Unit (GLU) layer, which is a variation of a linear layer that uses a gating mechanism. The input is split into two parts: one part is passed through a linear layer, and the other part is passed through a non-linear activation function. The output is the element-wise multiplication of the two parts.

glu_type

The type of activation function to use for gating. Options are ‘sigmoid’, ‘swish’, ‘relu’, and ‘gelu’.

Type: str

output_dim

The dimension of the output from the GLU layer.

Type: int

linear

The linear transformation applied to the input.

Type: nn.Linear
Parameters:
- input_dim (int) – The number of input features.
- output_dim (int) – The number of output features.
- glu_type (str) – The activation function used for the GLU gate. Defaults to “sigmoid”.
- bias_in_glu (bool) – Whether to include a bias term in the linear transformation. Defaults to True.

####### Examples

>>> glu_layer = GLU_Linear(input_dim=128, output_dim=64, glu_type='swish')
>>> input_tensor = torch.randn(32, 128)  # Batch size of 32
>>> output_tensor = glu_layer(input_tensor)
>>> output_tensor.shape
torch.Size([32, 64])

NOTE

The GLU mechanism is particularly useful in tasks such as natural language processing and speech processing, where controlling the flow of information can improve performance.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Processes input tensors through the BEATs encoder.

This method wraps the encoding process for compatibility with ESPnet’s AbsEncoder interface. It takes padded input features, their lengths, and optionally previous states to generate audio representations.

Parameters:
- xs_pad (torch.Tensor) – Padded input tensor of shape (B, T, D) where B is the batch size, T is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – Tensor containing the lengths of each sequence in the batch, shape (B,).
- prev_states (torch.Tensor , optional) – Previous states, not used in this implementation. Defaults to None.
Returns:
- audio_representation (torch.Tensor): Encoded audio representations of shape (B, T, D).
- output_lens (torch.Tensor): Lengths of the output sequences, shape (B,).
- masks (Optional[torch.Tensor]): Currently set to None.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

NOTE

If the input tensor is not provided, the operation can be costly as this function attempts to create a tensor of size maxlen x maxlen. To mitigate this, the input tensor is unsqueezed and then squeezed to optimize memory usage.

####### Examples

>>> encoder = BeatsEncoder(input_size=128)
>>> padded_inputs = torch.randn(10, 20, 128)  # (B, T, D)
>>> lengths = torch.tensor([20, 18, 15, 20, 10, 20, 20, 20, 20, 20])
>>> audio_rep, output_lens, masks = encoder.forward(padded_inputs, lengths)