espnet2.asvspoof.espnet_model.ESPnetASVSpoofModel

About 4 min

espnet2.asvspoof.espnet_model.ESPnetASVSpoofModel

class espnet2.asvspoof.espnet_model.ESPnetASVSpoofModel(frontend: AbsFrontend | None, specaug: AbsSpecAug | None, normalize: AbsNormalize | None, encoder: AbsEncoder, preencoder: AbsPreEncoder | None, decoder: AbsDecoder, losses: Dict[str, AbsASVSpoofLoss])

Bases: AbsESPnetModel

ASV Spoofing Model for Audio Signal Verification

This class implements a model for Automatic Speaker Verification (ASV) Spoofing detection. The model processes audio input through a series of components including a frontend, encoder, decoder, and loss calculation mechanisms.

preencoder

An optional pre-encoder for raw input data.

Type: Optional[AbsPreEncoder]

encoder

The encoder component that processes features.

Type:AbsEncoder

normalize

An optional normalization layer for feature scaling.

Type: Optional[AbsNormalize]

frontend

An optional frontend for feature extraction.

Type: Optional[AbsFrontend]

specaug

An optional specification augmentation component.

Type: Optional[AbsSpecAug]

decoder

The decoder component that predicts outcomes based on encoded features.

Type:AbsDecoder

losses

A dictionary containing various loss functions for training.

Type: Dict[str, AbsASVSpoofLoss]
Parameters:
- frontend (Optional [AbsFrontend ]) – An optional frontend for feature extraction.
- specaug (Optional [AbsSpecAug ]) – An optional specification augmentation component.
- normalize (Optional [AbsNormalize ]) – An optional normalization layer for feature scaling.
- encoder (AbsEncoder) – The encoder component that processes features.
- preencoder (Optional [AbsPreEncoder ]) – An optional pre-encoder for raw input data.
- decoder (AbsDecoder) – The decoder component that predicts outcomes based on encoded features.
- losses (Dict *[*str , AbsASVSpoofLoss ]) – A dictionary containing various loss functions for training.
Returns: None

########### Examples

>>> model = ESPnetASVSpoofModel(frontend=None, specaug=None, normalize=None,
...                             encoder=my_encoder, preencoder=None,
...                             decoder=my_decoder, losses=my_losses)
>>> speech_tensor = torch.randn(2, 16000)  # Example input
>>> loss, stats, weight = model.forward(speech_tensor)

######## NOTE Ensure that the input audio tensor is correctly shaped and matches the expected dimensions for processing.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

collect_feats(speech: Tensor, speech_lengths: Tensor, **kwargs) → Dict[str, Tensor]

Extracts features from the input speech tensor.

This method processes the input speech tensor and its corresponding lengths to extract features using the model’s frontend. It returns a dictionary containing the extracted features and their lengths.

Parameters:
- speech – A tensor of shape (Batch, Samples) representing the input speech signals.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech signal in the batch.
- kwargs – Additional keyword arguments for future extensibility.
Returns:
- ‘feats’: A tensor containing the extracted features of shape (Batch, NFrames, Dim).
- ’feats_lengths’: A tensor of shape (Batch,) containing the lengths of the extracted features.
Return type: A dictionary with the following keys

########### Examples

>>> model = ESPnetASVSpoofModel(...)
>>> speech = torch.randn(32, 16000)  # Batch of 32 audio samples
>>> speech_lengths = torch.tensor([16000] * 32)  # All samples are 1 sec
>>> features = model.collect_feats(speech, speech_lengths)
>>> print(features['feats'].shape)  # Expected output: (32, NFrames, Dim)

######## NOTE Ensure that the input tensors are correctly shaped to avoid assertion errors during processing.

Raises:
- AssertionError – If the input speech lengths are not of the correct
- dimension. –

encode(speech: Tensor, speech_lengths: Tensor) → Tuple[Tensor, Tensor]

Processes the input speech through the frontend and encoder.

This method extracts features from the input speech using the specified frontend and applies data augmentation, normalization, and pre-encoding steps if applicable. Finally, it forwards the processed features through the encoder to obtain the encoded outputs.

Parameters:
- speech – A tensor of shape (Batch, Length, …), representing the input speech waveforms.
- speech_lengths – A tensor of shape (Batch,), indicating the lengths of the input speech signals.
Returns:
- encoder_out: A tensor of shape (Batch, Length2, Dim), where Length2 is the output length after encoding.
- encoder_out_lens: A tensor of shape (Batch,) that contains the lengths of the encoded outputs.
Return type: A tuple containing
Raises:
- AssertionError – If the output sizes do not match the expected
- dimensions. –

########### Examples

>>> model = ESPnetASVSpoofModel(...)
>>> speech = torch.randn(8, 16000)  # 8 samples of 1 second
>>> speech_lengths = torch.tensor([16000] * 8)
>>> encoder_out, encoder_out_lens = model.encode(speech, speech_lengths)
>>> print(encoder_out.shape)  # Should be (8, Length2, Dim)
>>> print(encoder_out_lens.shape)  # Should be (8,)

######## NOTE This method assumes that the model is initialized with appropriate frontend and encoder components.

forward(speech: Tensor, speech_lengths: Tensor | None = None, label: Tensor | None = None, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Processes input speech through the model’s components and computes the loss.

This method combines the frontend, encoder, and decoder components of the ASV Spoofing model to produce predictions and calculate the associated loss. The output includes the computed loss, statistics, and batch size weight.

Parameters:
- speech (torch.Tensor) – A tensor containing the speech data with shape (Batch, samples).
- speech_lengths (torch.Tensor , optional) – A tensor representing the lengths of each speech sample in the batch. If not provided, defaults to None.
- label (torch.Tensor , optional) – A tensor containing the target labels for the speech data with shape (Batch, ). This is used for loss computation.
- **kwargs – Additional keyword arguments, where “utt_id” may be among the inputs.
Returns: A tuple containing:
- loss (torch.Tensor): The computed loss for the current batch.
- stats (Dict[str, torch.Tensor]): A dictionary with statistical : metrics, including loss and accuracy.
- weight (torch.Tensor): The weight of the current batch.
Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]
Raises:
- AssertionError – If the batch size of speech does not match the
- batch size of the label. –

########### Examples

>>> model = ESPnetASVSpoofModel(...)
>>> speech_data = torch.randn(32, 16000)  # Example speech data
>>> labels = torch.randint(0, 2, (32,))  # Example binary labels
>>> loss, stats, weight = model.forward(speech_data, label=labels)

######## NOTE Ensure that the label tensor is provided during training to compute the loss.