espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder
espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder
class espnet2.spk.encoder.rawnet3_encoder.RawNet3Encoder(input_size: int, block: str = 'Bottle2neck', model_scale: int = 8, ndim: int = 1024, output_size: int = 1536, **kwargs)
Bases: AbsEncoder
RawNet3 encoder. Extracts frame-level RawNet embeddings from raw waveform.
This encoder is designed for speaker recognition tasks and is based on the architecture presented in the paper by J. Jung et al., “Pushing the limits of raw waveform speaker recognition”, in Proc. INTERSPEECH, 2022.
_output_size
The output embedding dimension.
Type: int
Parameters:
- input_size (int) – Input feature dimension.
- block (str , optional) – Type of encoder block class to use. Default is “Bottle2neck”.
- model_scale (int , optional) – Scale value of the Res2Net architecture. Default is 8.
- ndim (int , optional) – Dimensionality of the hidden representation. Default is 1024.
- output_size (int , optional) – Output embedding dimension. Default is 1536.
######### Examples
>>> encoder = RawNet3Encoder(input_size=16000)
>>> waveform = torch.randn(1, 16000) # Example raw waveform input
>>> embeddings = encoder(waveform)
>>> print(embeddings.shape)
torch.Size([1, 1536, <sequence_length>]) # Output shape may vary
- Raises:ValueError – If an unsupported block type is provided.
NOTE
The encoder expects a 3D tensor as input with the shape (batch_size, input_size, sequence_length).
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x: Tensor)
Perform forward propagation through the RawNet3 encoder.
This method takes a batch of input tensors, processes them through several layers of the encoder, and outputs the frame-level embeddings.
- Parameters:x (torch.Tensor) – Input tensor of shape (batch_size, input_size, seq_len).
- Returns: Output tensor of shape (batch_size, output_size, seq_len).
- Return type: torch.Tensor
######### Examples
>>> encoder = RawNet3Encoder(input_size=128)
>>> input_tensor = torch.randn(10, 128, 16000) # Example input
>>> output_tensor = encoder.forward(input_tensor)
>>> print(output_tensor.shape) # Should print (10, 1536, seq_len)
NOTE
Ensure that the input tensor is properly shaped and normalized as expected by the model.
- Raises:ValueError – If the input tensor does not have the expected shape.
output_size() → int
RawNet3 encoder. Extracts frame-level RawNet embeddings from raw waveform.
Paper: J. Jung et al., “Pushing the limits of raw waveform speaker recognition”, in Proc. INTERSPEECH, 2022.
_output_size
Output embedding dimension.
Type: int
Parameters:
- input_size (int) – Input feature dimension.
- block (str) – Type of encoder block class to use. Default is “Bottle2neck”.
- model_scale (int) – Scale value of the Res2Net architecture. Default is 8.
- ndim (int) – Dimensionality of the hidden representation. Default is 1024.
- output_size (int) – Output embedding dimension. Default is 1536.
######### Examples
>>> encoder = RawNet3Encoder(input_size=128)
>>> output = encoder.forward(torch.randn(1, 128, 100))
>>> print(output.shape)
torch.Size([1, 1536, 98])