espnet2.slu.espnet_model.ESPnetSLUModel
espnet2.slu.espnet_model.ESPnetSLUModel
class espnet2.slu.espnet_model.ESPnetSLUModel(vocab_size: int, token_list: Tuple[str, ...] | List[str], frontend: AbsFrontend | None, specaug: AbsSpecAug | None, normalize: AbsNormalize | None, preencoder: AbsPreEncoder | None, encoder: AbsEncoder, postencoder: AbsPostEncoder | None, decoder: AbsDecoder, ctc: CTC, joint_network: Module | None, postdecoder: AbsPostDecoder | None = None, deliberationencoder: AbsPostEncoder | None = None, transcript_token_list: Tuple[str, ...] | List[str] | None = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, two_pass: bool = False, pre_postencoder_norm: bool = False)
Bases: ESPnetASRModel
CTC-attention hybrid Encoder-Decoder model for spoken language understanding.
This model combines the CTC (Connectionist Temporal Classification) and attention mechanisms to process and understand spoken language inputs. It can handle both speech and text inputs and is suitable for tasks such as speech recognition and natural language understanding.
blank_id
ID for the blank token in CTC.
- Type: int
sos
Start of sequence token ID.
- Type: int
eos
End of sequence token ID.
- Type: int
vocab_size
Size of the vocabulary.
- Type: int
ignore_id
ID of the token to ignore in loss calculations.
- Type: int
ctc_weight
Weight for the CTC loss.
- Type: float
interctc_weight
Weight for the intermediate CTC loss.
- Type: float
token_list
List of tokens.
- Type: List[str]
transcript_token_list
List of transcript tokens.
- Type: Optional[List[str]]
two_pass
Flag for using two-pass decoding.
- Type: bool
pre_postencoder_norm
Flag for normalization in pre/post-encoder.
- Type: bool
frontend
Frontend feature extractor.
- Type: Optional[AbsFrontend]
specaug
SpecAugment module for data augmentation.
- Type: Optional[AbsSpecAug]
normalize
Normalization module.
- Type: Optional[AbsNormalize]
preencoder
Pre-encoder module.
- Type: Optional[AbsPreEncoder]
postencoder
Post-encoder module.
- Type: Optional[AbsPostEncoder]
postdecoder
Post-decoder module.
- Type: Optional[AbsPostDecoder]
encoder
Main encoder module.
- Type:AbsEncoder
decoder
Decoder module.
- Type: Optional[AbsDecoder]
ctc
CTC module.
- Type:CTC
joint_network
Joint network for transducer.
- Type: Optional[torch.nn.Module]
deliberationencoder
Deliberation encoder.
- Type: Optional[AbsPostEncoder]
error_calculator
Error calculator for metrics.
Type: Optional[ErrorCalculator]
Parameters:
- vocab_size (int) – Size of the vocabulary.
- token_list (Union *[*Tuple *[*str , ... ] , List *[*str ] ]) – List of tokens.
- frontend (Optional [AbsFrontend ]) – Frontend feature extractor.
- specaug (Optional [AbsSpecAug ]) – SpecAugment module.
- normalize (Optional [AbsNormalize ]) – Normalization module.
- preencoder (Optional [AbsPreEncoder ]) – Pre-encoder module.
- encoder (AbsEncoder) – Encoder module.
- postencoder (Optional [AbsPostEncoder ]) – Post-encoder module.
- decoder (AbsDecoder) – Decoder module.
- ctc (CTC) – CTC module.
- joint_network (Optional *[*torch.nn.Module ]) – Joint network.
- postdecoder (Optional [AbsPostDecoder ]) – Post-decoder module.
- deliberationencoder (Optional [AbsPostEncoder ]) – Deliberation encoder.
- transcript_token_list (Union *[*Tuple *[*str , ... ] , List *[*str ] , None ] , optional) – List of transcript tokens.
- ctc_weight (float , optional) – Weight for CTC loss (default: 0.5).
- interctc_weight (float , optional) – Weight for intermediate CTC loss (default: 0.0).
- ignore_id (int , optional) – ID to ignore in loss calculations (default: -1).
- lsm_weight (float , optional) – Label smoothing weight (default: 0.0).
- length_normalized_loss (bool , optional) – Flag for length normalization (default: False).
- report_cer (bool , optional) – Flag to report CER (default: True).
- report_wer (bool , optional) – Flag to report WER (default: True).
- sym_space (str , optional) – Symbol for space (default: “<space>”).
- sym_blank (str , optional) – Symbol for blank (default: “<blank>”).
- extract_feats_in_collect_stats (bool , optional) – Flag to extract features in statistics collection (default: True).
- two_pass (bool , optional) – Flag for two-pass decoding (default: False).
- pre_postencoder_norm (bool , optional) – Flag for normalization (default: False).
Returns: Initializes the model parameters.
Return type: None
########### Examples
>>> model = ESPnetSLUModel(
... vocab_size=100,
... token_list=['<blank>', '<space>', "hello", "world"],
... frontend=None,
... specaug=None,
... normalize=None,
... preencoder=None,
... encoder=my_encoder,
... postencoder=None,
... decoder=my_decoder,
... ctc=my_ctc,
... joint_network=None
... )
>>> output = model.forward(speech_tensor, speech_lengths, text_tensor, text_lengths)
####### NOTE The model can be used in both training and inference modes. Ensure to appropriately set the training flags when using the model.
- Raises:AssertionError – If ctc_weight or interctc_weight is out of range.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
collect_feats(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, transcript: Tensor | None = None, transcript_lengths: Tensor | None = None, **kwargs) → Dict[str, Tensor]
Extract features from the input speech tensor.
This method processes the input speech tensor to extract relevant features and their corresponding lengths, returning them in a dictionary format. It is typically used in the context of speech recognition tasks to prepare input data for the model.
- Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the speech data.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of the speech sequences.
- text – A tensor of shape (Batch, Length) representing the corresponding text data (not used in feature extraction).
- text_lengths – A tensor of shape (Batch,) containing the lengths of the text sequences (not used in feature extraction).
- transcript – An optional tensor representing the transcript (default: None).
- transcript_lengths – An optional tensor representing the lengths of the transcripts (default: None).
- kwargs – Additional keyword arguments for future extension.
- Returns:
- “feats”: The extracted features tensor.
- ”feats_lengths”: The lengths of the extracted features.
- Return type: A dictionary containing
########### Examples
>>> model = ESPnetSLUModel(...)
>>> speech_tensor = torch.randn(4, 16000) # 4 samples, 1 second each
>>> speech_lengths = torch.tensor([16000, 15000, 14000, 13000])
>>> text_tensor = torch.tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
>>> text_lengths = torch.tensor([3, 3, 3, 3])
>>> result = model.collect_feats(speech_tensor, speech_lengths, text_tensor, text_lengths)
>>> print(result['feats'].shape)
torch.Size([4, ...]) # Shape depends on the feature extraction method
>>> print(result['feats_lengths'])
tensor([...]) # Lengths of the extracted features
####### NOTE This method primarily utilizes the _extract_feats function to perform the actual feature extraction from the input speech data.
encode(speech: Tensor, speech_lengths: Tensor, transcript_pad: Tensor | None = None, transcript_pad_lens: Tensor | None = None) → Tuple[Tensor, Tensor]
Processes the input speech through the frontend and encoder.
This method performs the following steps:
- Extracts features from the input speech.
- Applies data augmentation if specified and in training mode.
- Normalizes the features.
- Passes the features through the pre-encoder (if applicable).
- Feeds the processed features into the encoder.
- Optionally applies a post-encoder for further processing.
- Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech data.
- speech_lengths – A tensor of shape (Batch,) representing the lengths of the input speech sequences.
- transcript_pad – (Optional) A tensor for padded transcripts, if provided.
- transcript_pad_lens – (Optional) A tensor representing the lengths of the padded transcripts.
- Returns:
- encoder_out: A tensor of shape (Batch, Length2, Dim2) representing the output from the encoder.
- encoder_out_lens: A tensor representing the lengths of the encoder outputs.
- Return type: A tuple containing
########### Examples
>>> model = ESPnetSLUModel(vocab_size=5000, token_list=['<blank>', '<sos>', '<eos>'], ...)
>>> speech_data = torch.randn(2, 16000) # Example speech data
>>> speech_lengths = torch.tensor([16000, 15000]) # Lengths of the examples
>>> encoder_out, encoder_out_lens = model.encode(speech_data, speech_lengths)
####### NOTE This method is primarily used during inference in ASR tasks.
forward(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, transcript: Tensor | None = None, transcript_lengths: Tensor | None = None, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Perform a forward pass through the model, including the encoder, decoder, and loss calculation.
This method processes the input speech and text data, passing them through the model’s encoder and decoder components, and calculates the corresponding loss values based on the specified weights for CTC and attention losses.
- Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech data.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech input.
- text – A tensor of shape (Batch, Length) representing the target text sequences.
- text_lengths – A tensor of shape (Batch,) containing the lengths of each text input.
- transcript – (Optional) A tensor representing additional transcript information. Defaults to None.
- transcript_lengths – (Optional) A tensor of lengths for the transcripts. Defaults to None.
- kwargs – Additional keyword arguments, where “utt_id” is among the inputs.
- Returns:
- loss: A tensor representing the total loss computed.
- stats: A dictionary containing various statistics such as : loss values and error rates.
- weight: A tensor representing the batch size or weight for : loss computation.
- Return type: A tuple containing
- Raises:AssertionError – If the input dimensions do not match or if any of the assertions regarding tensor shapes fail.
########### Examples
>>> model = ESPnetSLUModel(...)
>>> speech_data = torch.randn(32, 16000) # Batch of 32 samples
>>> speech_lengths = torch.tensor([16000] * 32) # All lengths 16000
>>> text_data = torch.randint(0, 100, (32, 20)) # Batch of texts
>>> text_lengths = torch.tensor([20] * 32) # All lengths 20
>>> loss, stats, weight = model.forward(
... speech_data, speech_lengths, text_data, text_lengths
... )