espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder

About 2 min

espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder

class espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder(input_size: int, block: str = 'Bottle2neck', ndim: int = 1024, model_scale: int = 8, skablock: str = 'ResBlock', ska_dim: int = 128, output_size: int = 1536, **kwargs)

Bases: AbsEncoder

SKA-TDNN encoder. Extracts frame-level SKA-TDNN embeddings from features.

Paper: S. Mun, J. Jung et al., “Frequency and Multi-Scale Selective Kernel : Attention for Speaker Verification,’ in Proc. IEEE SLT 2022.

Parameters:
- input_size – Input feature dimension.
- block – Type of encoder block class to use. Defaults to “Bottle2neck”.
- ndim – Dimensionality of the hidden representation. Defaults to 1024.
- model_scale – Scale value of the Res2Net architecture. Defaults to 8.
- skablock – Type of SKA block to use. Defaults to “ResBlock”.
- ska_dim – Dimension of the SKA block. Defaults to 128.
- output_size – Output embedding dimension. Defaults to 1536.

_output_size

The output size of the encoder.

######### Examples

>>> encoder = SkaTdnnEncoder(input_size=40)
>>> output = encoder(torch.randn(1, 40, 100))  # (B, D, S) shape
>>> print(output.shape)  # Should output (1, 1536, T)

Raises:ValueError – If an unsupported block type is provided.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Forward function for the SkaTdnnEncoder.

This method processes the input tensor through a series of convolutional layers and residual blocks, ultimately producing a tensor of embeddings.

Parameters:x (torch.Tensor) – Input tensor of shape (B, D, S) where:
- B: Batch size
- D: Input feature dimension
- S: Sequence length
Returns: Output tensor of shape (B, output_size, T) where: : - output_size: Dimensionality of the output embeddings
- T: Output sequence length after processing through layers
Return type: torch.Tensor

######### Examples

>>> encoder = SkaTdnnEncoder(input_size=40)
>>> input_tensor = torch.randn(8, 40, 100)  # Batch of 8, 40 features, 100 time steps
>>> output = encoder.forward(input_tensor)
>>> output.shape
torch.Size([8, 1536, T])  # Output shape will depend on the processing

NOTE

The input tensor is permuted to match the expected shape for the convolutional layers and is reshaped appropriately throughout the forward pass.

output_size() → int

SKA-TDNN encoder. Extracts frame-level SKA-TDNN embeddings from features.

Paper: S. Mun, J. Jung et al., “Frequency and Multi-Scale Selective Kernel : Attention for Speaker Verification,’ in Proc. IEEE SLT 2022.

Parameters:
- input_size (int) – Input feature dimension.
- block (str) – Type of encoder block class to use. Default is “Bottle2neck”.
- model_scale (int) – Scale value of the Res2Net architecture. Default is 8.
- ndim (int) – Dimensionality of the hidden representation. Default is 1024.
- skablock (str) – Type of SKA block class to use. Default is “ResBlock”.
- ska_dim (int) – Dimension for the SKA block. Default is 128.
- output_size (int) – Output embedding dimension. Default is 1536.

_output_size

The output embedding dimension of the encoder.

Type: int

######### Examples

>>> encoder = SkaTdnnEncoder(input_size=40)
>>> output = encoder(torch.randn(10, 40, 100))
>>> output.shape
torch.Size([10, 1536])