espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder

About 2 min

espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder

class espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder(input_size: int, block: str = 'Bottle2neck', model_scale: int = 8, ndim: int = 1024, output_size: int = 1536, **kwargs)

Bases: AbsEncoder

RawNet3 encoder. Extracts frame-level RawNet embeddings from raw waveform.

This encoder is designed for speaker recognition tasks and is based on the architecture presented in the paper by J. Jung et al., “Pushing the limits of raw waveform speaker recognition”, in Proc. INTERSPEECH, 2022.

_output_size

The output embedding dimension.

Type: int
Parameters:
- input_size (int) – Input feature dimension.
- block (str , optional) – Type of encoder block class to use. Default is “Bottle2neck”.
- model_scale (int , optional) – Scale value of the Res2Net architecture. Default is 8.
- ndim (int , optional) – Dimensionality of the hidden representation. Default is 1024.
- output_size (int , optional) – Output embedding dimension. Default is 1536.

######### Examples

>>> encoder = RawNet3Encoder(input_size=16000)
>>> waveform = torch.randn(1, 16000)  # Example raw waveform input
>>> embeddings = encoder(waveform)
>>> print(embeddings.shape)
torch.Size([1, 1536, &lt;sequence_length&gt;])  # Output shape may vary

Raises:ValueError – If an unsupported block type is provided.

NOTE

The encoder expects a 3D tensor as input with the shape (batch_size, input_size, sequence_length).

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor)

Perform forward propagation through the RawNet3 encoder.

This method takes a batch of input tensors, processes them through several layers of the encoder, and outputs the frame-level embeddings.

Parameters:x (torch.Tensor) – Input tensor of shape (batch_size, input_size, seq_len).
Returns: Output tensor of shape (batch_size, output_size, seq_len).
Return type: torch.Tensor

######### Examples

>>> encoder = RawNet3Encoder(input_size=128)
>>> input_tensor = torch.randn(10, 128, 16000)  # Example input
>>> output_tensor = encoder.forward(input_tensor)
>>> print(output_tensor.shape)  # Should print (10, 1536, seq_len)

NOTE

Ensure that the input tensor is properly shaped and normalized as expected by the model.

Raises:ValueError – If the input tensor does not have the expected shape.

output_size() → int

RawNet3 encoder. Extracts frame-level RawNet embeddings from raw waveform.

Paper: J. Jung et al., “Pushing the limits of raw waveform speaker recognition”, in Proc. INTERSPEECH, 2022.

_output_size

Output embedding dimension.

Type: int
Parameters:
- input_size (int) – Input feature dimension.
- block (str) – Type of encoder block class to use. Default is “Bottle2neck”.
- model_scale (int) – Scale value of the Res2Net architecture. Default is 8.
- ndim (int) – Dimensionality of the hidden representation. Default is 1024.
- output_size (int) – Output embedding dimension. Default is 1536.

######### Examples

>>> encoder = RawNet3Encoder(input_size=128)
>>> output = encoder.forward(torch.randn(1, 128, 100))
>>> print(output.shape)
torch.Size([1, 1536, 98])