espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder

About 5 min

espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder

class espnet2.asr.decoder.hugging_face_transformers_decoder.HuggingFaceTransformersDecoder(vocab_size: int, encoder_output_size: int, model_name_or_path: str, causal_lm: bool = False, prefix: str = '', postfix: str = '', overriding_architecture_config: str | dict | None = {}, load_pretrained_weights: bool = True, separate_lm_head: bool = False)

Bases: AbsDecoder, BatchScorerInterface

Hugging Face Transformers Decoder.

This class implements a decoder that utilizes Hugging Face’s Transformers models for automatic speech recognition (ASR). It supports both causal language models and sequence-to-sequence models.

Parameters:
- vocab_size (int) – The size of the vocabulary.
- encoder_output_size (int) – The size of the encoder output.
- model_name_or_path (str) – The name or path of the pre-trained Transformers model.
- causal_lm (bool , optional) – Whether to use a causal language model. Defaults to False. This overrides the model_name_or_path if provided.
- prefix (str , optional) – Prefix to be added to the input tokens. Defaults to “”.
- postfix (str , optional) – Postfix to be added to the input tokens. Defaults to “”.
- overriding_architecture_config (str or dict , optional) – Path to the configuration JSON file or the JSON dictionary itself. Defaults to None. If this is set, it can be used to override the default decoder configuration.
- load_pretrained_weights (bool) – Whether to load the pre-trained weights. Defaults to True.
- separate_lm_head (bool) – True ensures that the language model head is not shared with the input token embeddings. When False, the original structure is kept, i.e., if the original Transformers implementation has tying of weights, it is retained. Defaults to False.
Raises:
- ImportError – If the transformers library is not available.
- Exception – If the word embeddings attribute cannot be found in the model.

############### Examples

>>> decoder = HuggingFaceTransformersDecoder(
...     vocab_size=5000,
...     encoder_output_size=256,
...     model_name_or_path="gpt2",
...     causal_lm=True
... )
>>> hs_pad = torch.rand(32, 10, 256)  # Example encoder output
>>> hlens = torch.tensor([10] * 32)  # Example lengths
>>> ys_in_pad = torch.randint(0, 5000, (32, 15))  # Example input
>>> ys_in_lens = torch.tensor([15] * 32)  # Example lengths
>>> output, output_lengths = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)

######### NOTE Ensure that the transformers library is installed to use this class.

Initializes the HuggingFaceTransformersDecoder.

Parameters:
- vocab_size (int) – The size of the vocabulary.
- encoder_output_size (int) – The size of the encoder output.
- model_name_or_path (str) – The name or path of the pre-trained Transformers model.
- causal_lm (bool , optional) – Whether to use a causal language model. Defaults to False. This overrides the model_name_or_path if provided.
- prefix (str , optional) – Prefix to be added to the input tokens. Defaults to “”.
- postfix (str , optional) – Postfix to be added to the input tokens. Defaults to “”.
- overriding_architecture_config (str or dict , optional) – Path to the configuration json file or the json dictionary itself. Defaults to None. If this is set, it can be used to override the default decoder configuration.
- load_pretrained_weights (bool) – Whether to load the pre-trained weights. Defaults to True.
- separate_lm_head (bool) – True ensures that the language model head is not shared with the input token embeddings. When False, the original structure is kept, ie, if the original Transformers implementation has tying of weights, it is retained. Defaults to False.
Raises:
- ImportError – If the transformers library is not available.
- Exception – If the word embeddings attribute cannot be found in the model.

add_prefix_postfix(enc_out, hlens, ys_in_pad, ys_in_lens)

Adds a prefix and postfix to the encoder output for token input during decoding.

This method constructs the input for the decoder by concatenating a prefix, the encoder output, a postfix, and the input token embeddings. It also generates the appropriate attention mask for the decoder.

Parameters:
- enc_out (torch.Tensor) – The encoded output from the encoder. Shape should be (batch_size, max_length, hidden_size).
- hlens (torch.Tensor) – A tensor containing the lengths of the encoder outputs for each sample in the batch. Shape should be (batch_size,).
- ys_in_pad (torch.Tensor) – Input tensor representing the target sequence. Shape should be (batch_size, max_length_out).
- ys_in_lens (torch.Tensor) – A tensor containing the lengths of the input target sequences. Shape should be (batch_size,).
Returns: A tuple containing: : - args (dict): A dictionary of inputs prepared for the decoder, : including ‘inputs_embeds’ and ‘attention_mask’.
- no_loss_lengths (torch.Tensor): A tensor containing the lengths : of the input sequences that will not contribute to the loss calculation.
Return type: Tuple[dict, torch.Tensor]

############### Examples

>>> prefix = "Hello"
>>> postfix = "World"
>>> enc_out = torch.rand(2, 10, 768)  # Example encoder output
>>> hlens = torch.tensor([10, 8])
>>> ys_in_pad = torch.tensor([[1, 2, 3], [1, 2, 0]])
>>> ys_in_lens = torch.tensor([3, 2])
>>> args, no_loss_lengths = decoder.add_prefix_postfix(enc_out, hlens,
...                                                  ys_in_pad, ys_in_lens)

######### NOTE The method handles padding on either the left or right side based on the tokenizer’s padding configuration. Ensure that the tokenizer is correctly initialized before calling this method.

batch_score(ys: Tensor, states: List[Any], xs: Tensor, speech: Tensor | None = None) → Tuple[Tensor, List[Any]]

Computes the batch scores for a sequence of input tokens.

This method processes the input sequences and calculates the scores for the next token predictions based on the encoder outputs.

Parameters:
- ys (torch.Tensor) – Tensor of shape (batch_size, sequence_length) containing the input sequences for which scores are to be computed.
- states (List *[*Any ]) – A list of states, which can be used to maintain information across the decoding steps.
- xs (torch.Tensor) – Tensor of shape (batch_size, feature_size) representing the encoder outputs for the corresponding sequences.
- speech (torch.Tensor , optional) – Optional tensor representing speech inputs. Defaults to None.
Returns: A tuple containing: : - next_token_scores (torch.Tensor): Tensor of shape (batch_size, vocab_size) containing the log probabilities of the next tokens.
- List[Any]: The updated list of states after processing the input sequences.
Return type: Tuple[torch.Tensor, List[Any]]

############### Examples

>>> decoder = HuggingFaceTransformersDecoder(...)
>>> ys = torch.tensor([[1, 2, 3], [4, 5, 6]])
>>> states = [None, None]
>>> xs = torch.randn(2, 256)  # Example encoder outputs
>>> scores, new_states = decoder.batch_score(ys, states, xs)
>>> print(scores.shape)  # Should print: torch.Size([2, vocab_size])

######### NOTE Ensure that the input tensors are properly padded and formatted before passing them to this method.

Raises:ValueError – If the input tensors have mismatched dimensions or are not compatible with the model.

forward(hs_pad: Tensor, hlens: Tensor, ys_in_pad: Tensor, ys_in_lens: Tensor) → Tuple[Tensor, Tensor]

Forward pass of the decoder.

This method processes the encoded memory from the encoder and the input tensor to generate token scores before softmax. It can handle both causal language models and sequence-to-sequence models based on the initialization parameters.

Parameters:
- hs_pad (torch.Tensor) – Encoded memory from the encoder with shape (batch, maxlen_in, feat).
- hlens (torch.Tensor) – Lengths of the encoded sequences with shape (batch).
- ys_in_pad (torch.Tensor) – Input tensor for the decoder with shape (batch, maxlen_out, #mels).
- ys_in_lens (torch.Tensor) – Lengths of the input sequences with shape (batch).
Returns: A tuple containing: : - x (torch.Tensor): Decoded token scores before softmax with shape (batch, maxlen_out, token).
- olens (torch.Tensor): Lengths of the output sequences with shape (batch,).
Return type: Tuple[torch.Tensor, torch.Tensor]

############### Examples

>>> decoder = HuggingFaceTransformersDecoder(...)
>>> hs_pad = torch.rand(2, 10, 512)  # Example encoded memory
>>> hlens = torch.tensor([10, 8])  # Example lengths
>>> ys_in_pad = torch.rand(2, 5, 80)  # Example input tensor
>>> ys_in_lens = torch.tensor([5, 4])  # Example lengths
>>> scores, lengths = decoder.forward(hs_pad, hlens, ys_in_pad,
...                                     ys_in_lens)

######### NOTE This method assumes that the model has been initialized with appropriate parameters, including whether it is a causal language model or a sequence-to-sequence model.

Raises:
- ValueError – If the shapes of the input tensors do not match the
- expected dimensions. –

reload_pretrained_parameters()

Reloads the pretrained parameters for the decoder and language model head.

This method is designed to load the previously saved pretrained parameters for the decoder and its language model head if the load_pretrained_weights attribute is set to True. If loading is skipped, a corresponding log message is generated.

load_pretrained_weights

Indicates whether to load pretrained weights or not.

Type: bool
Raises:Exception – If there are issues loading the pretrained parameters.

############### Examples

>>> decoder = HuggingFaceTransformersDecoder(
...     vocab_size=1000,
...     encoder_output_size=512,
...     model_name_or_path='gpt2',
... )
>>> decoder.reload_pretrained_parameters()
Loaded pretrained Transformers decoder parameters!

score(ys, state, x, speech=None)

Scores the next token in a sequence given the current input.

This method computes the score for the next token based on the current state of the decoder and the input sequence. It utilizes the Hugging Face Transformers framework to perform the necessary computations.

Parameters:
- ys (torch.Tensor) – The input tensor representing the sequence of tokens (batch_size, sequence_length).
- state (Any) – The current state of the decoder, which may contain necessary context for scoring.
- x (torch.Tensor) – The encoder outputs from the previous step (batch_size, encoder_output_size).
- speech (torch.Tensor , optional) – Optional input tensor representing the speech features, if applicable. Defaults to None.
Returns: A tuple containing: : - next_token_scores (torch.Tensor): Log probabilities of the next token (batch_size * num_beams, vocab_size).
- None: Placeholder for future extension (currently unused).
Return type: Tuple[torch.Tensor, None]

############### Examples

>>> decoder = HuggingFaceTransformersDecoder(...)
>>> ys = torch.tensor([[1, 2, 3]])  # Example input tensor
>>> state = ...  # Some decoder state
>>> x = torch.rand(1, encoder_output_size)  # Encoder output
>>> scores, _ = decoder.score(ys, state, x)
>>> print(scores.shape)  # (1, vocab_size)

######### NOTE This method currently does not implement caching, which could improve performance for successive calls.