espnet2.slu.espnet_model.ESPnetSLUModel

About 6 min

espnet2.slu.espnet_model.ESPnetSLUModel

class espnet2.slu.espnet_model.ESPnetSLUModel(vocab_size: int, token_list: Tuple[str, ...] | List[str], frontend: AbsFrontend | None, specaug: AbsSpecAug | None, normalize: AbsNormalize | None, preencoder: AbsPreEncoder | None, encoder: AbsEncoder, postencoder: AbsPostEncoder | None, decoder: AbsDecoder, ctc: CTC, joint_network: Module | None, postdecoder: AbsPostDecoder | None = None, deliberationencoder: AbsPostEncoder | None = None, transcript_token_list: Tuple[str, ...] | List[str] | None = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True, two_pass: bool = False, pre_postencoder_norm: bool = False)

Bases: ESPnetASRModel

CTC-attention hybrid Encoder-Decoder model for spoken language understanding.

This model combines the CTC (Connectionist Temporal Classification) and attention mechanisms to process and understand spoken language inputs. It can handle both speech and text inputs and is suitable for tasks such as speech recognition and natural language understanding.

blank_id

ID for the blank token in CTC.

Type: int

sos

Start of sequence token ID.

Type: int

eos

End of sequence token ID.

Type: int

vocab_size

Size of the vocabulary.

Type: int

ignore_id

ID of the token to ignore in loss calculations.

Type: int

ctc_weight

Weight for the CTC loss.

Type: float

interctc_weight

Weight for the intermediate CTC loss.

Type: float

token_list

List of tokens.

Type: List[str]

transcript_token_list

List of transcript tokens.

Type: Optional[List[str]]

two_pass

Flag for using two-pass decoding.

Type: bool

pre_postencoder_norm

Flag for normalization in pre/post-encoder.

Type: bool

frontend

Frontend feature extractor.

Type: Optional[AbsFrontend]

specaug

SpecAugment module for data augmentation.

Type: Optional[AbsSpecAug]

normalize

Normalization module.

Type: Optional[AbsNormalize]

preencoder

Pre-encoder module.

Type: Optional[AbsPreEncoder]

postencoder

Post-encoder module.

Type: Optional[AbsPostEncoder]

postdecoder

Post-decoder module.

Type: Optional[AbsPostDecoder]

encoder

Main encoder module.

Type:AbsEncoder

decoder

Decoder module.

Type: Optional[AbsDecoder]

ctc

CTC module.

Type:CTC

joint_network

Joint network for transducer.

Type: Optional[torch.nn.Module]

deliberationencoder

Deliberation encoder.

Type: Optional[AbsPostEncoder]

error_calculator

Error calculator for metrics.

Type: Optional[ErrorCalculator]
Parameters:
- vocab_size (int) – Size of the vocabulary.
- token_list (Union *[*Tuple *[*str , ... ] , List *[*str ] ]) – List of tokens.
- frontend (Optional [AbsFrontend ]) – Frontend feature extractor.
- specaug (Optional [AbsSpecAug ]) – SpecAugment module.
- normalize (Optional [AbsNormalize ]) – Normalization module.
- preencoder (Optional [AbsPreEncoder ]) – Pre-encoder module.
- encoder (AbsEncoder) – Encoder module.
- postencoder (Optional [AbsPostEncoder ]) – Post-encoder module.
- decoder (AbsDecoder) – Decoder module.
- ctc (CTC) – CTC module.
- joint_network (Optional *[*torch.nn.Module ]) – Joint network.
- postdecoder (Optional [AbsPostDecoder ]) – Post-decoder module.
- deliberationencoder (Optional [AbsPostEncoder ]) – Deliberation encoder.
- transcript_token_list (Union *[*Tuple *[*str , ... ] , List *[*str ] , None ] , optional) – List of transcript tokens.
- ctc_weight (float , optional) – Weight for CTC loss (default: 0.5).
- interctc_weight (float , optional) – Weight for intermediate CTC loss (default: 0.0).
- ignore_id (int , optional) – ID to ignore in loss calculations (default: -1).
- lsm_weight (float , optional) – Label smoothing weight (default: 0.0).
- length_normalized_loss (bool , optional) – Flag for length normalization (default: False).
- report_cer (bool , optional) – Flag to report CER (default: True).
- report_wer (bool , optional) – Flag to report WER (default: True).
- sym_space (str , optional) – Symbol for space (default: “<space>”).
- sym_blank (str , optional) – Symbol for blank (default: “<blank>”).
- extract_feats_in_collect_stats (bool , optional) – Flag to extract features in statistics collection (default: True).
- two_pass (bool , optional) – Flag for two-pass decoding (default: False).
- pre_postencoder_norm (bool , optional) – Flag for normalization (default: False).
Returns: Initializes the model parameters.
Return type: None

########### Examples

>>> model = ESPnetSLUModel(
...     vocab_size=100,
...     token_list=['&lt;blank&gt;', '&lt;space&gt;', "hello", "world"],
...     frontend=None,
...     specaug=None,
...     normalize=None,
...     preencoder=None,
...     encoder=my_encoder,
...     postencoder=None,
...     decoder=my_decoder,
...     ctc=my_ctc,
...     joint_network=None
... )
>>> output = model.forward(speech_tensor, speech_lengths, text_tensor, text_lengths)

####### NOTE The model can be used in both training and inference modes. Ensure to appropriately set the training flags when using the model.

Raises:AssertionError – If ctc_weight or interctc_weight is out of range.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

collect_feats(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, transcript: Tensor | None = None, transcript_lengths: Tensor | None = None, **kwargs) → Dict[str, Tensor]

Extract features from the input speech tensor.

This method processes the input speech tensor to extract relevant features and their corresponding lengths, returning them in a dictionary format. It is typically used in the context of speech recognition tasks to prepare input data for the model.

Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the speech data.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of the speech sequences.
- text – A tensor of shape (Batch, Length) representing the corresponding text data (not used in feature extraction).
- text_lengths – A tensor of shape (Batch,) containing the lengths of the text sequences (not used in feature extraction).
- transcript – An optional tensor representing the transcript (default: None).
- transcript_lengths – An optional tensor representing the lengths of the transcripts (default: None).
- kwargs – Additional keyword arguments for future extension.
Returns:
- “feats”: The extracted features tensor.
- ”feats_lengths”: The lengths of the extracted features.
Return type: A dictionary containing

########### Examples

>>> model = ESPnetSLUModel(...)
>>> speech_tensor = torch.randn(4, 16000)  # 4 samples, 1 second each
>>> speech_lengths = torch.tensor([16000, 15000, 14000, 13000])
>>> text_tensor = torch.tensor([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
>>> text_lengths = torch.tensor([3, 3, 3, 3])
>>> result = model.collect_feats(speech_tensor, speech_lengths, text_tensor, text_lengths)
>>> print(result['feats'].shape)
torch.Size([4, ...])  # Shape depends on the feature extraction method
>>> print(result['feats_lengths'])
tensor([...])  # Lengths of the extracted features

####### NOTE This method primarily utilizes the _extract_feats function to perform the actual feature extraction from the input speech data.

encode(speech: Tensor, speech_lengths: Tensor, transcript_pad: Tensor | None = None, transcript_pad_lens: Tensor | None = None) → Tuple[Tensor, Tensor]

Processes the input speech through the frontend and encoder.

This method performs the following steps:

Extracts features from the input speech.
Applies data augmentation if specified and in training mode.
Normalizes the features.
Passes the features through the pre-encoder (if applicable).
Feeds the processed features into the encoder.
Optionally applies a post-encoder for further processing.

Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech data.
- speech_lengths – A tensor of shape (Batch,) representing the lengths of the input speech sequences.
- transcript_pad – (Optional) A tensor for padded transcripts, if provided.
- transcript_pad_lens – (Optional) A tensor representing the lengths of the padded transcripts.
Returns:
- encoder_out: A tensor of shape (Batch, Length2, Dim2) representing the output from the encoder.
- encoder_out_lens: A tensor representing the lengths of the encoder outputs.
Return type: A tuple containing

########### Examples

>>> model = ESPnetSLUModel(vocab_size=5000, token_list=['&lt;blank&gt;', '&lt;sos&gt;', '&lt;eos&gt;'], ...)
>>> speech_data = torch.randn(2, 16000)  # Example speech data
>>> speech_lengths = torch.tensor([16000, 15000])  # Lengths of the examples
>>> encoder_out, encoder_out_lens = model.encode(speech_data, speech_lengths)

####### NOTE This method is primarily used during inference in ASR tasks.

forward(speech: Tensor, speech_lengths: Tensor, text: Tensor, text_lengths: Tensor, transcript: Tensor | None = None, transcript_lengths: Tensor | None = None, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]

Perform a forward pass through the model, including the encoder, decoder, and loss calculation.

This method processes the input speech and text data, passing them through the model’s encoder and decoder components, and calculates the corresponding loss values based on the specified weights for CTC and attention losses.

Parameters:
- speech – A tensor of shape (Batch, Length, …) representing the input speech data.
- speech_lengths – A tensor of shape (Batch,) containing the lengths of each speech input.
- text – A tensor of shape (Batch, Length) representing the target text sequences.
- text_lengths – A tensor of shape (Batch,) containing the lengths of each text input.
- transcript – (Optional) A tensor representing additional transcript information. Defaults to None.
- transcript_lengths – (Optional) A tensor of lengths for the transcripts. Defaults to None.
- kwargs – Additional keyword arguments, where “utt_id” is among the inputs.
Returns:
- loss: A tensor representing the total loss computed.
- stats: A dictionary containing various statistics such as : loss values and error rates.
- weight: A tensor representing the batch size or weight for : loss computation.
Return type: A tuple containing
Raises:AssertionError – If the input dimensions do not match or if any of the assertions regarding tensor shapes fail.

########### Examples

>>> model = ESPnetSLUModel(...)
>>> speech_data = torch.randn(32, 16000)  # Batch of 32 samples
>>> speech_lengths = torch.tensor([16000] * 32)  # All lengths 16000
>>> text_data = torch.randint(0, 100, (32, 20))  # Batch of texts
>>> text_lengths = torch.tensor([20] * 32)  # All lengths 20
>>> loss, stats, weight = model.forward(
...     speech_data, speech_lengths, text_data, text_lengths
... )