espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling
espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling
class espnet2.spk.pooling.chn_attn_stat_pooling.ChnAttnStatPooling(input_size: int = 1536)
Bases: AbsPooling
Aggregates frame-level features to a single utterance-level feature.
This pooling method is proposed in B. Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”.
_output_size
The output dimensionality, which is double the input size.
Type: int
Parameters:input_size (int) – Dimensionality of the input frame-level embeddings. This is determined by the encoder hyperparameter. The output dimensionality will be double the input_size.
Returns: The pooled utterance-level feature.
Return type: torch.Tensor
Raises:ValueError – If task_tokens is not None, as ChannelAttentiveStatisticsPooling is not adequate for task tokens.
######### Examples
>>> pooling_layer = ChnAttnStatPooling(input_size=1536)
>>> frame_level_features = torch.randn(10, 1536, 20) # (batch_size, input_size, time_steps)
>>> pooled_features = pooling_layer(frame_level_features)
>>> print(pooled_features.shape) # Output shape will be (10, 3072)
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x, task_tokens: Tensor | None = None)
Performs the forward pass of the ChnAttnStatPooling layer, aggregating
frame-level features into a single utterance-level feature representation.
This method computes a weighted combination of the input features and statistical summaries (mean and standard deviation) to produce a compact representation. It uses channel attention to emphasize important features.
- Parameters:
- x (torch.Tensor) – Input tensor of shape (batch_size, input_size, time).
- task_tokens (torch.Tensor , optional) – An optional tensor for task-specific tokens. If provided, a ValueError will be raised, as this pooling method does not support task tokens.
- Returns: Output tensor of shape (batch_size, output_size), where output_size is double the input_size.
- Return type: torch.Tensor
- Raises:
- ValueError – If task_tokens is provided, indicating that this pooling
- method is not suitable for task tokens. –
######### Examples
>>> pooling_layer = ChnAttnStatPooling(input_size=1536)
>>> input_tensor = torch.randn(10, 1536, 100) # (batch_size, input_size, time)
>>> output = pooling_layer.forward(input_tensor)
>>> print(output.shape)
torch.Size([10, 3072]) # (batch_size, output_size)
NOTE
The output size of this pooling layer is always double the input size, which is achieved by concatenating the mean and standard deviation of the input features.
output_size()
Aggregates frame-level features to a single utterance-level feature.
Proposed in B.Desplanques et al., “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification”.
attention
A sequential container that applies a series of layers including convolution, activation, batch normalization, and softmax to compute attention weights.
Type: nn.Sequential
Parameters:input_size (int) – Dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter. For this pooling layer, the output dimensionality will be double the input_size.
Returns: The output size of the pooling layer, which is double the input size.
Return type: int
######### Examples
>>> pooling_layer = ChnAttnStatPooling(input_size=512)
>>> output_size = pooling_layer.output_size()
>>> print(output_size)
1024
- Raises:ValueError – If task_tokens is not None during the forward pass, indicating that ChannelAttentiveStatisticsPooling is not adequate for task tokens.