espnet2.spk.espnet_model.ESPnetSpeakerModel

About 6 min

espnet2.spk.espnet_model.ESPnetSpeakerModel

Bases: AbsESPnetModel

Speaker embedding extraction model.

Core model for diverse speaker-related tasks (e.g., verification, open-set identification, diarization).

The model architecture comprises mainly ‘encoder’, ‘pooling’, and ‘projector’. In common speaker recognition field, the combination of three would usually be named as ‘speaker_encoder’ (or speaker embedding extractor). We split it into three for flexibility in future extensions:

‘encoder’ : Extracts frame-level speaker embeddings.
‘pooling’ : Aggregates into single utterance-level embedding.
‘projector’ : connected layer) to derive speaker embedding.

Possibly, in the future, ‘pooling’ and/or ‘projector’ can be integrated as a ‘decoder’, depending on the extension for joint usage of different tasks (e.g., ASR, SE, target speaker extraction).

frontend

The frontend processing module.

Type: Optional[AbsFrontend]

specaug

The spec augmentation module.

Type: Optional[AbsSpecAug]

normalize

The normalization module.

Type: Optional[AbsNormalize]

encoder

The encoder module.

Type: Optional[AbsEncoder]

pooling

The pooling module.

Type: Optional[AbsPooling]

projector

The projector module.

Type: Optional[AbsProjector]

loss

The loss function used during training.

Type: Optional[AbsLoss]
Parameters:
- frontend – Frontend processing module.
- specaug – Spec augmentation module.
- normalize – Normalization module.
- encoder – Encoder module.
- pooling – Pooling module.
- projector – Projector module.
- loss – Loss function.

################# Examples

>>> model = ESPnetSpeakerModel(frontend=None, specaug=None,
...                             normalize=None, encoder=my_encoder,
...                             pooling=my_pooling, projector=my_projector,
...                             loss=my_loss)
>>> speech = torch.randn(10, 16000)  # Batch of 10 audio samples
>>> spk_labels = torch.randint(0, 10, (10,))  # Random speaker labels
>>> loss, stats, weight = model.forward(speech, spk_labels=spk_labels)

######### NOTE Ensure that the appropriate modules are provided to the constructor for correct functioning.

Raises:
- AssertionError – If the dimensions of input tensors do not match
- expected shapes. –

Initialize internal Module state, shared by both nn.Module and ScriptModule.

aggregate(frame_level_feats: Tensor) → Tensor

Aggregate frame-level features into utterance-level features.

This method processes a batch of frame-level features, aggregating them to produce a single utterance-level feature representation. It uses the configured aggregator to perform this operation.

Parameters:frame_level_feats – A tensor of shape (Batch, Frame, Features) that contains the frame-level features to be aggregated.
Returns: A tensor of shape (Batch, Features) representing the aggregated utterance-level features.

################# Examples

>>> model = ESPnetSpeakerModel(...)
>>> frame_level_feats = torch.randn(32, 100, 64)  # Batch of 32
>>> utt_level_feat = model.aggregate(frame_level_feats)
>>> print(utt_level_feat.shape)  # Should output: torch.Size([32, Features])

collect_feats(speech: Tensor, speech_lengths: Tensor, spk_labels: Tensor | None = None, **kwargs) → Dict[str, Tensor]

Collects features from the input speech tensor.

This method extracts the features from the input speech signal and returns them in a dictionary format. It leverages the extract_feats method to process the speech input, applying any necessary transformations such as augmentation and normalization.

Parameters:
- speech – A tensor containing the speech data of shape (Batch, samples).
- speech_lengths – A tensor indicating the lengths of each speech sample in the batch of shape (Batch,).
- spk_labels – (Optional) A tensor containing one-hot encoded speaker labels used for training, of shape (Batch,).
Returns:
- “feats”: The processed feature tensor.
Return type: A dictionary containing the extracted features

################# Examples

>>> model = ESPnetSpeakerModel(...)
>>> speech_tensor = torch.randn(2, 16000)  # Example batch of speech
>>> lengths = torch.tensor([16000, 16000])  # Example lengths
>>> features = model.collect_feats(speech_tensor, lengths)
>>> print(features["feats"].shape)  # Check the shape of extracted features

encode_frame(feats: Tensor) → Tensor

Encode frame-level features from the input features using the encoder.

This method processes the input features to extract frame-level speaker embeddings. It utilizes the encoder component of the model to achieve this.

Parameters:feats – A tensor of shape (Batch, Features, Time) representing the input features from which frame-level embeddings are to be extracted.
Returns: A tensor of shape (Batch, Frame_Features, Time) containing the frame-level speaker embeddings extracted from the input features.

################# Examples

>>> model = ESPnetSpeakerModel(...)
>>> input_feats = torch.randn(32, 40, 100)  # Batch of 32, 40 features, 100 time steps
>>> frame_level_feats = model.encode_frame(input_feats)
>>> print(frame_level_feats.shape)  # Output shape: (32, Frame_Features, 100)

######### NOTE Ensure that the input tensor is properly shaped according to the model’s requirements for the encoder.

extract_feats(speech: Tensor, speech_lengths: Tensor) → Tuple[Tensor, Tensor]

Extract features from input speech tensor.

This method processes the input speech signal to extract features using the defined frontend, applies any specified augmentations, and normalizes the resulting features. It is a critical step in the speaker embedding extraction process.

Parameters:
- speech – A tensor of shape (Batch, Samples) representing the input speech waveforms.
- speech_lengths – A tensor of shape (Batch,) indicating the lengths of each speech signal in the batch. If None, it assumes that all signals are of equal length.
Returns:
- feats: A tensor of extracted features.
- feat_lengths: A tensor indicating the lengths of the extracted features for each sample in the batch. Returns None if the frontend is not defined.
Return type: A tuple containing

################# Examples

>>> model = ESPnetSpeakerModel(...)
>>> speech_tensor = torch.randn(8, 16000)  # 8 samples of 1 second audio
>>> lengths = torch.tensor([16000] * 8)    # All samples are 1 second long
>>> feats, feat_lengths = model.extract_feats(speech_tensor, lengths)

######### NOTE The method first checks if a frontend is defined. If so, it will use it to extract features; otherwise, it will return the raw speech signal as features. Augmentations and normalization are only applied if the model is in training mode and the respective modules are defined.

forward(speech: Tensor, spk_labels: Tensor | None = None, task_tokens: Tensor | None = None, extract_embd: bool = False, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor] | Tensor

Feed-forward through encoder layers and aggregate into utterance-level feature.

This method processes the input speech tensor through the model’s components, extracting frame-level features, aggregating them into a single utterance-level feature, and optionally computing the loss based on provided speaker labels. If extract_embd is set to True, it returns the speaker embedding directly without computing the loss.

Parameters:
- speech – A tensor of shape (Batch, samples) representing the input speech signals.
- spk_labels – A tensor of shape (Batch,) representing one-hot speaker labels used during training. If provided, the loss will be calculated.
- task_tokens – A tensor of shape (Batch,) used for token-based training, indicating the task for each input.
- extract_embd – A boolean flag indicating whether to return the speaker embedding directly without going through the classification head. Defaults to False.
- **kwargs – Additional keyword arguments for future extensions.
Returns: If extract_embd is True, returns the speaker embedding tensor. Otherwise, returns a tuple containing:
- loss: A tensor representing the computed loss.
- stats: A dictionary containing statistics (e.g., loss).
- weight: The batch size.
Raises:
- AssertionError – If spk_labels is provided but its shape does not
- match the batch size of speech. –
- AssertionError – If task_tokens is provided but its shape does not
- match the batch size of speech. –
- AssertionError – If spk_labels is None when calculating the loss.

################# Examples

>>> model = ESPnetSpeakerModel(...)
>>> speech_input = torch.randn(32, 16000)  # 32 samples of 1 second
>>> speaker_labels = torch.randint(0, 10, (32,))  # Random labels
>>> loss, stats, weight = model.forward(speech_input, speaker_labels)
>>> spk_embd = model.forward(speech_input, extract_embd=True)

######### NOTE This method is designed to be called during both training and inference, with behavior changing based on the provided arguments.

project_spk_embd(utt_level_feat: Tensor) → Tensor

Speaker embedding extraction model.

Core model for diverse speaker-related tasks (e.g., verification, open-set identification, diarization).

The model architecture comprises mainly ‘encoder’, ‘pooling’, and ‘projector’. In the common speaker recognition field, the combination of these three components is usually referred to as ‘speaker_encoder’ or ‘speaker embedding extractor’. We have separated them into three distinct components for flexibility in future extensions:

‘encoder’ : Extracts frame-level speaker embeddings.
‘pooling’ : utterance-level embedding.
‘projector’ : connected layer) to derive the final speaker embedding.

In the future, ‘pooling’ and/or ‘projector’ may be integrated as a ‘decoder’, depending on the extensions for joint usage of different tasks (e.g., ASR, SE, target speaker extraction).

frontend

The frontend component for feature extraction.

Type: Optional[AbsFrontend]

specaug

The spec augmentation component.

Type: Optional[AbsSpecAug]

normalize

The normalization component.

Type: Optional[AbsNormalize]

encoder

The encoder component for extracting embeddings.

Type: Optional[AbsEncoder]

pooling

The pooling component for aggregating embeddings.

Type: Optional[AbsPooling]

projector

The projector component for further processing.

Type: Optional[AbsProjector]

loss

The loss function for training.

Type: Optional[AbsLoss]
Parameters:
- frontend (Optional [AbsFrontend ]) – Frontend for feature extraction.
- specaug (Optional [AbsSpecAug ]) – Spec augmentation module.
- normalize (Optional [AbsNormalize ]) – Normalization module.
- encoder (Optional [AbsEncoder ]) – Encoder module.
- pooling (Optional [AbsPooling ]) – Pooling module.
- projector (Optional [AbsProjector ]) – Projector module.
- loss (Optional [AbsLoss ]) – Loss module.

################# Examples

>>> model = ESPnetSpeakerModel(frontend=None, specaug=None,
...                             normalize=None, encoder=encoder,
...                             pooling=pooling, projector=projector,
...                             loss=loss)
>>> output = model.forward(speech_tensor, spk_labels_tensor)

######### NOTE Ensure that the input tensors have the correct dimensions.