espnet2.slu.postencoder.conformer_postencoder.ConformerPostEncoder

About 3 min

espnet2.slu.postencoder.conformer_postencoder.ConformerPostEncoder

class espnet2.slu.postencoder.conformer_postencoder.ConformerPostEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'linear', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1)

Bases: AbsPostEncoder

Conformer PostEncoder for sequence-to-sequence models.

This class implements a Conformer encoder module that processes input sequences using multi-head attention, convolutional layers, and position-wise feed-forward networks. It is designed to enhance the performance of speech and language processing tasks by capturing contextual information effectively.

output_size

The output dimension of the encoder.

Type: int

embed

The embedding layer that includes positional encoding.

Type: torch.nn.Sequential

normalize_before

Flag to indicate if layer normalization is applied before the first block.

Type: bool

encoders

A list of encoder layers.

Type: torch.nn.ModuleList

after_norm

Layer normalization applied after the encoder if normalize_before is True.

Type: torch.nn.LayerNorm
Parameters:
- input_size (int) – Input dimension.
- output_size (int) – Dimension of attention (default: 256).
- attention_heads (int) – Number of heads in multi-head attention (default: 4).
- linear_units (int) – Number of units in position-wise feed forward (default: 2048).
- num_blocks (int) – Number of decoder blocks (default: 6).
- dropout_rate (float) – Dropout rate (default: 0.1).
- attention_dropout_rate (float) – Dropout rate in attention (default: 0.0).
- positional_dropout_rate (float) – Dropout rate after adding positional encoding (default: 0.1).
- input_layer (Union *[*str , torch.nn.Module ]) – Input layer type (default: “linear”).
- normalize_before (bool) – Whether to use layer normalization before the first block (default: True).
- concat_after (bool) – Whether to concatenate input and output of attention layer (default: False).
- positionwise_layer_type (str) – Type of position-wise layer (“linear”, “conv1d”, or “conv1d-linear”, default: “linear”).
- positionwise_conv_kernel_size (int) – Kernel size for position-wise convolution (default: 3).
- rel_pos_type (str) – Type of relative positional encoding (“legacy” or “latest”, default: “legacy”).
- encoder_pos_enc_layer_type (str) – Type of encoder positional encoding layer (default: “rel_pos”).
- encoder_attn_layer_type (str) – Type of encoder attention layer (default: “selfattn”).
- activation_type (str) – Activation function type (default: “swish”).
- macaron_style (bool) – Whether to use Macaron style for position-wise layer (default: False).
- use_cnn_module (bool) – Whether to use convolution module (default: True).
- zero_triu (bool) – Whether to zero the upper triangular part of the attention matrix (default: False).
- cnn_module_kernel (int) – Kernel size of convolution module (default: 31).
- padding_idx (int) – Padding index for input_layer=embed (default: -1).
Raises:ValueError – If unknown values are provided for rel_pos_type, pos_enc_layer_type, or input_layer.

######### Examples

>>> encoder = ConformerPostEncoder(input_size=128)
>>> input_tensor = torch.randn(10, 20, 128)  # (batch_size, seq_len, feature_dim)
>>> input_lengths = torch.tensor([20] * 10)  # All sequences are of length 20
>>> output, output_lengths = encoder(input_tensor, input_lengths)
>>> print(output.shape)  # Should match (batch_size, seq_len, output_size)
>>> print(output_lengths.shape)  # Should match (batch_size,)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]

Forward pass through the ConformerPostEncoder.

This method processes the input tensor through the embedding layer, followed by multiple encoder layers. It applies masking to handle padded sequences and normalizes the output if specified.

Parameters:
- input (torch.Tensor) – The input tensor of shape (batch_size, sequence_length, feature_dim).
- input_lengths (torch.Tensor) – A tensor of shape (batch_size,) containing the lengths of each input sequence before padding.
Returns: A tuple containing: : - output (torch.Tensor): The processed output tensor of shape (batch_size, sequence_length, output_dim).
- olens (torch.Tensor): A tensor of shape (batch_size,) representing the lengths of the outputs after processing.
Return type: Tuple[torch.Tensor, torch.Tensor]

######### Examples

>>> encoder = ConformerPostEncoder(input_size=128)
>>> input_tensor = torch.randn(32, 100, 128)  # Batch of 32
>>> input_lengths = torch.tensor([100] * 32)  # All sequences have length 100
>>> output, output_lengths = encoder(input_tensor, input_lengths)
>>> output.shape
torch.Size([32, 100, 256])  # Assuming output size is 256

NOTE

The input tensor should be appropriately padded and the input_lengths should reflect the actual lengths of the sequences to ensure correct masking.

Raises:ValueError – If the input tensor’s shape does not match the expected dimensions or if the input_lengths tensor has an incompatible size.

output_size

() → int

Get the output size of the ConformerPostEncoder.

This method returns the output size that was specified during the initialization of the ConformerPostEncoder. The output size is the dimension of the attention layer, which is used to shape the output of the encoder.

Returns: The output size of the encoder.
Return type: int

######### Examples

>>> conformer_post_encoder = ConformerPostEncoder(output_size=512)
>>> conformer_post_encoder.output_size()
512