espnet2.spk.encoder.conformer_encoder.MfaConformerEncoder

About 2 min

espnet2.spk.encoder.conformer_encoder.MfaConformerEncoder

class espnet2.spk.encoder.conformer_encoder.MfaConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str | None = 'conv2d2', normalize_before: bool = True, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, stochastic_depth_rate: float | List[float] = 0.0, layer_drop_rate: float = 0.0, max_pos_emb_len: int = 5000, padding_idx: int | None = None)

Bases: AbsEncoder

Conformer encoder module for MFA-Conformer.

This module implements a Conformer encoder as described in the paper: Y. Zhang et al.,

``

Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification,’’ in Proc. INTERSPEECH, 2022.

_output_size

The output size of the encoder, calculated as output_size multiplied by num_blocks.

Type: int
Parameters:
- input_size (int) – Input dimension.
- output_size (int) – Dimension of attention.
- attention_heads (int) – The number of heads of multi head attention.
- linear_units (int) – The number of units of position-wise feed forward.
- num_blocks (int) – The number of encoder blocks.
- dropout_rate (float) – Dropout rate.
- attention_dropout_rate (float) – Dropout rate in attention.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding.
- input_layer (Union *[*str , torch.nn.Module ]) – Input layer type.
- normalize_before (bool) – Whether to use layer_norm before the first block.
- positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
- positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
- rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More details can be found in https://github.com/espnet/espnet/pull/2816.
- encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
- encoder_attn_layer_type (str) – Encoder attention layer type.
- activation_type (str) – Encoder activation function type.
- macaron_style (bool) – Whether to use macaron style for positionwise layer.
- use_cnn_module (bool) – Whether to use convolution module.
- zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
- cnn_module_kernel (int) – Kernel size of convolution module.
- padding_idx (int) – Padding index for input_layer=embed.

######### Examples

>>> encoder = MfaConformerEncoder(input_size=80, output_size=256)
>>> input_tensor = torch.randn(32, 100, 80)  # (batch, seq_len, input_size)
>>> output_tensor = encoder(input_tensor)
>>> output_tensor.shape
torch.Size([32, 100, 1536])  # (batch, seq_len, output_size * num_blocks)

Raises:
- ValueError – If an unknown value is provided for rel_pos_type, pos_enc_layer_type, input_layer, or selfattention_layer_type.
- NotImplementedError – If the positionwise_layer_type is not supported.
- ValueError – If the length of stochastic_depth_rate does not match num_blocks.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor) → Tuple[Tensor, Tensor, Tensor | None]

Calculate forward propagation.

This method computes the forward pass of the MfaConformerEncoder, taking an input tensor and passing it through the embedding layer and a series of encoder layers, producing an output tensor.

Parameters:x (torch.Tensor) – Input tensor of shape (#batch, L, input_size).
Returns:
- Output tensor of shape (#batch, L, output_size).
- Intermediate outputs for each encoder layer.
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:NotImplementedError – If the input layer type is not supported.

######### Examples

>>> encoder = MfaConformerEncoder(input_size=128)
>>> input_tensor = torch.randn(32, 100, 128)  # 32 batches, 100 length
>>> output = encoder.forward(input_tensor)
>>> print(output.shape)
torch.Size([32, 100, 1536])  # Assuming output_size is 256 and num_blocks is 6

NOTE

The method assumes the input tensor is compatible with the defined input layer type (e.g., Conv2d subsampling layers).

output_size() → int

output_size (int): The output size of the encoder, calculated as the

product of the output size per block and the number of blocks in the encoder. This value represents the dimension of the output tensor produced by the encoder after processing the input.

Returns: The computed output size of the encoder.
Return type: int

######### Examples

>>> encoder = MfaConformerEncoder(input_size=128, output_size=256,
...                                 num_blocks=6)
>>> encoder.output_size()
1536  # 256 * 6

NOTE

This property is useful for determining the shape of the output tensor when the encoder processes input data.