espnet2.hubert.espnet_model.TorchAudioHubertPretrainModel

About 4 min

espnet2.hubert.espnet_model.TorchAudioHubertPretrainModel

class espnet2.hubert.espnet_model.TorchAudioHubertPretrainModel(vocab_size: int, token_list: Tuple[str, ...] | List[str], frontend: AbsFrontend | None, specaug: AbsSpecAug | None, normalize: AbsNormalize | None, preencoder: AbsPreEncoder | None, encoder: AbsEncoder, ignore_id: int = -1, **kwargs)

Bases: AbsESPnetModel

TorchAudio Hubert Pretrain model.

This model implements the HuBERT pretraining for audio representations, utilizing a combination of frontend processing, data augmentation, and normalization techniques. It inherits from the AbsESPnetModel class.

vocab_size

Size of the vocabulary.

Type: int

ignore_id

ID to ignore in the loss calculation.

Type: int

token_list

List of tokens used in the model.

Type: List[str]

frontend

Frontend for audio feature extraction.

Type:AbsFrontend

specaug

SpecAugment for data augmentation.

Type:AbsSpecAug

normalize

Normalization layer.

Type:AbsNormalize

preencoder

Pre-encoder for raw input data.

Type:AbsPreEncoder

encoder

Main encoder for processing features.

Type:AbsEncoder

error_calculator

Error calculation utility.

Type: Optional[ErrorCalculator]

nan_loss_count

Counter for NaN losses encountered.

Type: float
Parameters:
- vocab_size (int) – Size of the vocabulary.
- token_list (Union *[*Tuple *[*str , ... ] , List *[*str ] ]) – List of tokens.
- frontend (Optional [AbsFrontend ]) – Frontend module.
- specaug (Optional [AbsSpecAug ]) – SpecAugment module.
- normalize (Optional [AbsNormalize ]) – Normalization module.
- preencoder (Optional [AbsPreEncoder ]) – Pre-encoder module.
- encoder (AbsEncoder) – Encoder module.
- ignore_id (int , optional) – ID to ignore (default: -1).
- lsm_weight (float , optional) – Label smoothing weight (default: 0.0).
- length_normalized_loss (bool , optional) – Whether to use length-normalized loss (default: False).
- report_cer (bool , optional) – Whether to report Character Error Rate (default: False).
- report_wer (bool , optional) – Whether to report Word Error Rate (default: False).
- sym_space (str , optional) – Symbol for space (default: “<space>”).
- sym_blank (str , optional) – Symbol for blank (default: “<blank>”).
- pred_masked_weight (float , optional) – Weight for masked prediction (default: 1.0).
- pred_nomask_weight (float , optional) – Weight for non-masked prediction (default: 0.0).
- loss_weights (float , optional) – Additional weights for loss calculation (default: 0.0).
- **kwargs – Additional keyword arguments.
Returns: A tuple containing the loss tensor, statistics dictionary, and weight tensor.
Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]

######### Examples

model = TorchAudioHubertPretrainModel(vocab_size=100, token_list=[“<pad>”, “<sos>”, “<eos>”], : frontend=my_frontend, encoder=my_encoder)

loss, stats, weight = model(speech_tensor, speech_lengths_tensor, text_tensor, text_lengths_tensor)

NOTE

This model is based on the work by Abdelrahman Mohamed and Wei-Ning Hsu, detailed in the paper: https://arxiv.org/pdf/2106.07447.pdf.

Raises:AssertionError – If input tensor dimensions do not match expected shapes.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

collect_feats(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, **kwargs) → Dict[str, Tensor]

Collect features from the input speech tensor and its lengths.

This method extracts features from the provided speech data using the frontend defined in the model. It returns a dictionary containing the extracted features and their corresponding lengths.

Parameters:
- speech (torch.Tensor) – The input speech tensor of shape (Batch, Length, …).
- speech_lengths (torch.Tensor) – A tensor containing the lengths of each speech input in the batch of shape (Batch,).
- text (torch.Tensor) – A tensor containing the text data of shape (Batch, Length).
- text_lengths (torch.Tensor) – A tensor containing the lengths of each text input in the batch of shape (Batch,).
- kwargs – Additional keyword arguments.
Returns: A dictionary containing: : - ’feats’: Extracted features of shape (Batch, NFrames, Dim).
- ’feats_lengths’: Lengths of the extracted features of shape (Batch,).
Return type: Dict[str, torch.Tensor]

######### Examples

>>> model = HubertPretrainModel(...)
>>> speech = torch.randn(10, 16000)  # 10 samples of 1 second audio
>>> speech_lengths = torch.tensor([16000] * 10)  # lengths for each sample
>>> text = torch.randint(0, 100, (10, 20))  # random text tensor
>>> text_lengths = torch.tensor([20] * 10)  # lengths for each text
>>> features = model.collect_feats(speech, speech_lengths, text, text_lengths)
>>> print(features['feats'].shape)  # Output: torch.Size([10, NFrames, Dim])
>>> print(features['feats_lengths'])  # Output: lengths of features

encode(speech: Tensor, speech_lengths: Tensor, y_pad: Tensor, y_pad_length: Tensor) → Tuple[Tensor, Tensor]

Frontend + Encoder. Note that this method is used by asr_inference.py

This method processes the input speech data through the frontend and encoder to produce encoded features.

Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech signals.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech signal in the batch.
- y_pad – A tensor of shape (Batch, Length, …) representing the padded target sequences.
- y_pad_length – A tensor of shape (Batch,) containing the lengths of each padded target sequence.
Returns: A tuple containing: : - encoder_out: The output from the encoder, a tensor of shape (Batch, Length2, Dim2).
- feats: The extracted features after passing through the frontend.
Return type: Tuple[torch.Tensor, torch.Tensor]

NOTE

This method is typically called during the forward pass of the model to obtain encoded representations of the input speech data.

forward(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Frontend + Encoder + Calc loss

This method processes input speech and text data through the model’s frontend and encoder components, computes the loss, and returns the results along with accuracy statistics. It ensures that the input dimensions are consistent and handles data-parallelism for batch processing.

Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech data.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech sample in the batch.
- text – A tensor of shape (Batch, Length) representing the input text data.
- text_lengths – A tensor of shape (Batch,) containing the lengths of each text sample in the batch.
- kwargs – Additional keyword arguments, which may include “utt_id”.
Returns:
- loss (torch.Tensor): The computed loss value.
- stats (Dict[str, torch.Tensor]): A dictionary with statistics including accuracy metrics.
- weight (torch.Tensor): A tensor representing the batch size for DataParallel handling.
Return type: A tuple containing
Raises:AssertionError – If the dimensions of input tensors do not match.

######### Examples

>>> model = TorchAudioHubertPretrainModel(...)
>>> speech_tensor = torch.randn(4, 16000)  # Batch of 4, 1 second audio
>>> speech_lengths = torch.tensor([16000, 16000, 16000, 16000])
>>> text_tensor = torch.randint(0, 100, (4, 20))  # Batch of 4, text
>>> text_lengths = torch.tensor([20, 20, 20, 20])
>>> loss, stats, weight = model.forward(speech_tensor,
                                        speech_lengths,
                                        text_tensor,
                                        text_lengths)