espnet2.spk.encoder.xvector_encoder.XvectorEncoder
espnet2.spk.encoder.xvector_encoder.XvectorEncoder
class espnet2.spk.encoder.xvector_encoder.XvectorEncoder(input_size: int, ndim: int = 512, output_size: int = 1500, kernel_sizes: List = [5, 3, 3, 1, 1], paddings: List = [2, 1, 1, 0, 0], dilations: List = [1, 2, 3, 1, 1], **kwargs)
Bases: AbsEncoder
X-vector encoder. Extracts frame-level x-vector embeddings from features.
This class implements the X-vector model for speaker recognition as described in the paper by D. Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition,” presented at IEEE ICASSP, 2018. The model takes input features and processes them through a series of convolutional layers to produce speaker embeddings.
layers
A list of convolutional layers, ReLU activations, and batch normalization layers.
- Type: nn.ModuleList
_output_size
The output embedding dimension.
Type: int
Parameters:
- input_size (int) – Input feature dimension.
- ndim (int , optional) – Dimensionality of the hidden representation. Defaults to 512.
- output_size (int , optional) – Output embedding dimension. Defaults to 1500.
- kernel_sizes (List , optional) – List of kernel sizes for each convolutional layer. Defaults to [5, 3, 3, 1, 1].
- paddings (List , optional) – List of padding sizes for each convolutional layer. Defaults to [2, 1, 1, 0, 0].
- dilations (List , optional) – List of dilation rates for each convolutional layer. Defaults to [1, 2, 3, 1, 1].
- **kwargs – Additional keyword arguments.
######### Examples
>>> encoder = XvectorEncoder(input_size=40)
>>> input_tensor = torch.randn(10, 100, 40) # (Batch, Sequence, Features)
>>> output = encoder(input_tensor)
>>> print(output.shape) # Output shape will depend on the configuration
NOTE
This implementation is adapted for ESPnet-SPK by Jee-weon Jung, and cross-checked with the SpeechBrain implementation: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/lobes/models/Xvector.py
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x)
Forward pass of the X-vector encoder.
This method processes the input tensor through the layers of the encoder, transforming the input features into frame-level x-vector embeddings. The input tensor should have the shape (B, S, D), where B is the batch size, S is the sequence length, and D is the feature dimension. The output will have the shape (B, output_size, new_S), where new_S is determined by the convolutional layers.
- Parameters:x (torch.Tensor) – Input tensor of shape (B, S, D).
- Returns: Output tensor after passing through the encoder layers, with shape (B, output_size, new_S).
- Return type: torch.Tensor
######### Examples
>>> import torch
>>> encoder = XvectorEncoder(input_size=40)
>>> input_tensor = torch.randn(10, 100, 40) # (B, S, D)
>>> output_tensor = encoder.forward(input_tensor)
>>> print(output_tensor.shape)
torch.Size([10, 1500, new_S]) # Output shape will vary based on new_S
output_size() → int
X-vector encoder. Extracts frame-level x-vector embeddings from features.
Paper: D. Snyder et al., “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018.
- Parameters:
- input_size – Input feature dimension.
- ndim – Dimensionality of the hidden representation.
- output_size – Output embedding dimension.
layers
A list of convolutional layers, activation functions, and batch normalization layers.
- Returns: The output embedding dimension.
- Return type: int
######### Examples
Creating an instance of the XvectorEncoder
encoder = XvectorEncoder(input_size=40, ndim=512, output_size=1500)
Accessing the output size
print(encoder.output_size()) # Outputs: 1500
NOTE
This class is adapted for ESPnet-SPK by Jee-weon Jung, and it is cross-checked with the SpeechBrain implementation.