espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder
espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder
class espnet2.spk.encoder.ska_tdnn_encoder.SkaTdnnEncoder(input_size: int, block: str = 'Bottle2neck', ndim: int = 1024, model_scale: int = 8, skablock: str = 'ResBlock', ska_dim: int = 128, output_size: int = 1536, **kwargs)
Bases: AbsEncoder
SKA-TDNN encoder. Extracts frame-level SKA-TDNN embeddings from features.
Paper: S. Mun, J. Jung et al., “Frequency and Multi-Scale Selective Kernel : Attention for Speaker Verification,’ in Proc. IEEE SLT 2022.
- Parameters:
- input_size – Input feature dimension.
- block – Type of encoder block class to use. Defaults to “Bottle2neck”.
- ndim – Dimensionality of the hidden representation. Defaults to 1024.
- model_scale – Scale value of the Res2Net architecture. Defaults to 8.
- skablock – Type of SKA block to use. Defaults to “ResBlock”.
- ska_dim – Dimension of the SKA block. Defaults to 128.
- output_size – Output embedding dimension. Defaults to 1536.
_output_size
The output size of the encoder.
######### Examples
>>> encoder = SkaTdnnEncoder(input_size=40)
>>> output = encoder(torch.randn(1, 40, 100)) # (B, D, S) shape
>>> print(output.shape) # Should output (1, 1536, T)
- Raises:ValueError – If an unsupported block type is provided.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x)
Forward function for the SkaTdnnEncoder.
This method processes the input tensor through a series of convolutional layers and residual blocks, ultimately producing a tensor of embeddings.
- Parameters:x (torch.Tensor) – Input tensor of shape (B, D, S) where:
- B: Batch size
- D: Input feature dimension
- S: Sequence length
- Returns: Output tensor of shape (B, output_size, T) where: : - output_size: Dimensionality of the output embeddings
- T: Output sequence length after processing through layers
- Return type: torch.Tensor
######### Examples
>>> encoder = SkaTdnnEncoder(input_size=40)
>>> input_tensor = torch.randn(8, 40, 100) # Batch of 8, 40 features, 100 time steps
>>> output = encoder.forward(input_tensor)
>>> output.shape
torch.Size([8, 1536, T]) # Output shape will depend on the processing
NOTE
- The input tensor is permuted to match the expected shape for the convolutional layers and is reshaped appropriately throughout the forward pass.
output_size() → int
SKA-TDNN encoder. Extracts frame-level SKA-TDNN embeddings from features.
Paper: S. Mun, J. Jung et al., “Frequency and Multi-Scale Selective Kernel : Attention for Speaker Verification,’ in Proc. IEEE SLT 2022.
- Parameters:
- input_size (int) – Input feature dimension.
- block (str) – Type of encoder block class to use. Default is “Bottle2neck”.
- model_scale (int) – Scale value of the Res2Net architecture. Default is 8.
- ndim (int) – Dimensionality of the hidden representation. Default is 1024.
- skablock (str) – Type of SKA block class to use. Default is “ResBlock”.
- ska_dim (int) – Dimension for the SKA block. Default is 128.
- output_size (int) – Output embedding dimension. Default is 1536.
_output_size
The output embedding dimension of the encoder.
- Type: int
######### Examples
>>> encoder = SkaTdnnEncoder(input_size=40)
>>> output = encoder(torch.randn(10, 40, 100))
>>> output.shape
torch.Size([10, 1536])