espnet2.hubert.espnet_model.TorchAudioHubertPretrainModel
espnet2.hubert.espnet_model.TorchAudioHubertPretrainModel
class espnet2.hubert.espnet_model.TorchAudioHubertPretrainModel(vocab_size: int, token_list: Tuple[str, ...] | List[str], frontend: AbsFrontend | None, specaug: AbsSpecAug | None, normalize: AbsNormalize | None, preencoder: AbsPreEncoder | None, encoder: AbsEncoder, ignore_id: int = -1, **kwargs)
Bases: AbsESPnetModel
TorchAudio Hubert Pretrain model.
This model implements the HuBERT pretraining for audio representations, utilizing a combination of frontend processing, data augmentation, and normalization techniques. It inherits from the AbsESPnetModel class.
vocab_size
Size of the vocabulary.
- Type: int
ignore_id
ID to ignore in the loss calculation.
- Type: int
token_list
List of tokens used in the model.
- Type: List[str]
frontend
Frontend for audio feature extraction.
- Type:AbsFrontend
specaug
SpecAugment for data augmentation.
- Type:AbsSpecAug
normalize
Normalization layer.
- Type:AbsNormalize
preencoder
Pre-encoder for raw input data.
- Type:AbsPreEncoder
encoder
Main encoder for processing features.
- Type:AbsEncoder
error_calculator
Error calculation utility.
- Type: Optional[ErrorCalculator]
nan_loss_count
Counter for NaN losses encountered.
Type: float
Parameters:
- vocab_size (int) – Size of the vocabulary.
- token_list (Union *[*Tuple *[*str , ... ] , List *[*str ] ]) – List of tokens.
- frontend (Optional [AbsFrontend ]) – Frontend module.
- specaug (Optional [AbsSpecAug ]) – SpecAugment module.
- normalize (Optional [AbsNormalize ]) – Normalization module.
- preencoder (Optional [AbsPreEncoder ]) – Pre-encoder module.
- encoder (AbsEncoder) – Encoder module.
- ignore_id (int , optional) – ID to ignore (default: -1).
- lsm_weight (float , optional) – Label smoothing weight (default: 0.0).
- length_normalized_loss (bool , optional) – Whether to use length-normalized loss (default: False).
- report_cer (bool , optional) – Whether to report Character Error Rate (default: False).
- report_wer (bool , optional) – Whether to report Word Error Rate (default: False).
- sym_space (str , optional) – Symbol for space (default: “<space>”).
- sym_blank (str , optional) – Symbol for blank (default: “<blank>”).
- pred_masked_weight (float , optional) – Weight for masked prediction (default: 1.0).
- pred_nomask_weight (float , optional) – Weight for non-masked prediction (default: 0.0).
- loss_weights (float , optional) – Additional weights for loss calculation (default: 0.0).
- **kwargs – Additional keyword arguments.
Returns: A tuple containing the loss tensor, statistics dictionary, and weight tensor.
Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]
######### Examples
model = TorchAudioHubertPretrainModel(vocab_size=100, token_list=[“<pad>”, “<sos>”, “<eos>”], : frontend=my_frontend, encoder=my_encoder)
loss, stats, weight = model(speech_tensor, speech_lengths_tensor, text_tensor, text_lengths_tensor)
NOTE
This model is based on the work by Abdelrahman Mohamed and Wei-Ning Hsu, detailed in the paper: https://arxiv.org/pdf/2106.07447.pdf.
- Raises:AssertionError – If input tensor dimensions do not match expected shapes.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
collect_feats(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, **kwargs) → Dict[str, Tensor]
Collect features from the input speech tensor and its lengths.
This method extracts features from the provided speech data using the frontend defined in the model. It returns a dictionary containing the extracted features and their corresponding lengths.
- Parameters:
- speech (torch.Tensor) – The input speech tensor of shape (Batch, Length, …).
- speech_lengths (torch.Tensor) – A tensor containing the lengths of each speech input in the batch of shape (Batch,).
- text (torch.Tensor) – A tensor containing the text data of shape (Batch, Length).
- text_lengths (torch.Tensor) – A tensor containing the lengths of each text input in the batch of shape (Batch,).
- kwargs – Additional keyword arguments.
- Returns: A dictionary containing: : - ’feats’: Extracted features of shape (Batch, NFrames, Dim).
- ’feats_lengths’: Lengths of the extracted features of shape (Batch,).
- Return type: Dict[str, torch.Tensor]
######### Examples
>>> model = HubertPretrainModel(...)
>>> speech = torch.randn(10, 16000) # 10 samples of 1 second audio
>>> speech_lengths = torch.tensor([16000] * 10) # lengths for each sample
>>> text = torch.randint(0, 100, (10, 20)) # random text tensor
>>> text_lengths = torch.tensor([20] * 10) # lengths for each text
>>> features = model.collect_feats(speech, speech_lengths, text, text_lengths)
>>> print(features['feats'].shape) # Output: torch.Size([10, NFrames, Dim])
>>> print(features['feats_lengths']) # Output: lengths of features
encode(speech: Tensor, speech_lengths: Tensor, y_pad: Tensor, y_pad_length: Tensor) → Tuple[Tensor, Tensor]
Frontend + Encoder. Note that this method is used by asr_inference.py
This method processes the input speech data through the frontend and encoder to produce encoded features.
- Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech signals.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech signal in the batch.
- y_pad – A tensor of shape (Batch, Length, …) representing the padded target sequences.
- y_pad_length – A tensor of shape (Batch,) containing the lengths of each padded target sequence.
- Returns: A tuple containing: : - encoder_out: The output from the encoder, a tensor of shape (Batch, Length2, Dim2).
- feats: The extracted features after passing through the frontend.
- Return type: Tuple[torch.Tensor, torch.Tensor]
NOTE
This method is typically called during the forward pass of the model to obtain encoded representations of the input speech data.
forward(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Frontend + Encoder + Calc loss
This method processes input speech and text data through the model’s frontend and encoder components, computes the loss, and returns the results along with accuracy statistics. It ensures that the input dimensions are consistent and handles data-parallelism for batch processing.
- Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech data.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech sample in the batch.
- text – A tensor of shape (Batch, Length) representing the input text data.
- text_lengths – A tensor of shape (Batch,) containing the lengths of each text sample in the batch.
- kwargs – Additional keyword arguments, which may include “utt_id”.
- Returns:
- loss (torch.Tensor): The computed loss value.
- stats (Dict[str, torch.Tensor]): A dictionary with statistics including accuracy metrics.
- weight (torch.Tensor): A tensor representing the batch size for DataParallel handling.
- Return type: A tuple containing
- Raises:AssertionError – If the dimensions of input tensors do not match.
######### Examples
>>> model = TorchAudioHubertPretrainModel(...)
>>> speech_tensor = torch.randn(4, 16000) # Batch of 4, 1 second audio
>>> speech_lengths = torch.tensor([16000, 16000, 16000, 16000])
>>> text_tensor = torch.randint(0, 100, (4, 20)) # Batch of 4, text
>>> text_lengths = torch.tensor([20, 20, 20, 20])
>>> loss, stats, weight = model.forward(speech_tensor,
speech_lengths,
text_tensor,
text_lengths)