espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling

About 2 min

espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling

class espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling(input_size: int = 1536)

Bases: AbsPooling

Aggregates frame-level features to a single utterance-level feature.

This pooling method is proposed in B. Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”.

_output_size

The output dimensionality, which is double the input size.

Type: int
Parameters:input_size (int) – Dimensionality of the input frame-level embeddings. This is determined by the encoder hyperparameter. The output dimensionality will be double the input_size.
Returns: The pooled utterance-level feature.
Return type: torch.Tensor
Raises:ValueError – If task_tokens is not None, as ChannelAttentiveStatisticsPooling is not adequate for task tokens.

######### Examples

>>> pooling_layer = ChnAttnStatPooling(input_size=1536)
>>> frame_level_features = torch.randn(10, 1536, 20)  # (batch_size, input_size, time_steps)
>>> pooled_features = pooling_layer(frame_level_features)
>>> print(pooled_features.shape)  # Output shape will be (10, 3072)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x, task_tokens: Tensor | None = None)

Performs the forward pass of the ChnAttnStatPooling layer, aggregating

frame-level features into a single utterance-level feature representation.

This method computes a weighted combination of the input features and statistical summaries (mean and standard deviation) to produce a compact representation. It uses channel attention to emphasize important features.

Parameters:
- x (torch.Tensor) – Input tensor of shape (batch_size, input_size, time).
- task_tokens (torch.Tensor , optional) – An optional tensor for task-specific tokens. If provided, a ValueError will be raised, as this pooling method does not support task tokens.
Returns: Output tensor of shape (batch_size, output_size), where output_size is double the input_size.
Return type: torch.Tensor
Raises:
- ValueError – If task_tokens is provided, indicating that this pooling
- method is not suitable for task tokens. –

######### Examples

>>> pooling_layer = ChnAttnStatPooling(input_size=1536)
>>> input_tensor = torch.randn(10, 1536, 100)  # (batch_size, input_size, time)
>>> output = pooling_layer.forward(input_tensor)
>>> print(output.shape)
torch.Size([10, 3072])  # (batch_size, output_size)

NOTE

The output size of this pooling layer is always double the input size, which is achieved by concatenating the mean and standard deviation of the input features.

output_size()

Aggregates frame-level features to a single utterance-level feature.

Proposed in B.Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”.

attention

A sequential container that applies a series of layers including convolution, activation, batch normalization, and softmax to compute attention weights.

Type: nn.Sequential
Parameters:input_size (int) – Dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter. For this pooling layer, the output dimensionality will be double the input_size.
Returns: The output size of the pooling layer, which is double the input size.
Return type: int

######### Examples

>>> pooling_layer = ChnAttnStatPooling(input_size=512)
>>> output_size = pooling_layer.output_size()
>>> print(output_size)
1024

Raises:ValueError – If task_tokens is not None during the forward pass, indicating that ChannelAttentiveStatisticsPooling is not adequate for task tokens.