espnet2.spk.pooling.stat_pooling.StatsPooling
espnet2.spk.pooling.stat_pooling.StatsPooling
class espnet2.spk.pooling.stat_pooling.StatsPooling(input_size: int = 1536)
Bases: AbsPooling
Aggregates frame-level features to single utterance-level feature.
This class implements the statistical pooling method as proposed in D. Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition”. It takes frame-level embeddings and computes the mean and standard deviation across the time dimension, concatenating them to form a single utterance-level feature.
_output_size
The dimensionality of the output feature, which is double the input size.
Type: int
Parameters:input_size (int) – Dimensionality of the input frame-level embeddings. This is determined by the encoder hyperparameter. The output dimensionality will be double the input_size.
Raises:ValueError – If task_tokens is provided during the forward pass, as this pooling method is not suitable for task-specific tokens.
######### Examples
>>> pooling_layer = StatsPooling(input_size=512)
>>> frame_level_embeddings = torch.randn(10, 512) # (time, features)
>>> output = pooling_layer.forward(frame_level_embeddings)
>>> print(output.shape) # Output will be (10, 1024), where 1024 = 2 * 512
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x, task_tokens: Tensor | None = None)
Args: x (torch.Tensor): Input tensor of shape (batch_size, num_frames,
input_size), where num_frames is the number of frames and input_size is the dimensionality of the input embeddings.
task_tokens (torch.Tensor, optional): A tensor of task tokens. If : provided, a ValueError will be raised since this pooling method does not support task tokens.
- Returns: Output tensor of shape (batch_size, output_size), : where output_size is double the input_size. The output contains the mean and standard deviation of the input embeddings across the frame dimension.
- Return type: torch.Tensor
- Raises:ValueError – If task_tokens is not None, as this pooling method does not support task-specific tokens.
######### Examples
>>> pooling = StatsPooling(input_size=1536)
>>> input_tensor = torch.rand(10, 5, 1536) # Batch of 10, 5 frames
>>> output = pooling.forward(input_tensor)
>>> output.shape
torch.Size([10, 3072]) # Output size is 2 * input_size
NOTE
This method aggregates frame-level features into a single utterance-level feature by calculating the mean and standard deviation across the frame dimension.
output_size()
Aggregates frame-level features to single utterance-level feature.
Proposed in D. Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition”.
input_size
Dimensionality of the input frame-level embeddings. Determined by encoder hyperparameter. For this pooling layer, the output dimensionality will be double of the input_size.
Type: int
Parameters:input_size – Dimensionality of the input frame-level embeddings.
Returns: The output size, which is double the input_size.
Return type: int
######### Examples
>>> pooling = StatsPooling(input_size=128)
>>> pooling.output_size()
256
NOTE
The output size is fixed at double the input size for this pooling layer.