espnet2.asr.decoder.mlm_decoder.MLMDecoder

About 3 min

espnet2.asr.decoder.mlm_decoder.MLMDecoder

class espnet2.asr.decoder.mlm_decoder.MLMDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)

Bases: AbsDecoder

Masked LM Decoder definition for sequence-to-sequence models.

This class implements a masked language model decoder that utilizes multi-head attention and position-wise feed-forward networks. It is designed to handle the decoding of sequences while incorporating positional encodings and normalization techniques.

embed

The embedding layer that converts input token IDs to embeddings. Can be an embedding layer or a linear layer.

Type: torch.nn.Sequential

normalize_before

Indicates whether to apply normalization before the decoder layers.

Type: bool

after_norm

Layer normalization applied after the decoder layers if normalize_before is True.

Type:LayerNorm, optional

output_layer

Linear layer for output if use_output_layer is True.

Type: torch.nn.Linear, optional

decoders

A list of decoder layers that process the input embeddings and produce output scores.

Type: torch.nn.ModuleList
Parameters:
- vocab_size (int) – Size of the vocabulary, including a mask token.
- encoder_output_size (int) – Size of the encoder output features.
- attention_heads (int , optional) – Number of attention heads. Defaults to 4.
- linear_units (int , optional) – Number of units in the feed-forward layers. Defaults to 2048.
- num_blocks (int , optional) – Number of decoder layers. Defaults to 6.
- dropout_rate (float , optional) – Dropout rate for regularization. Defaults to 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encodings. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – Type of input layer, either “embed” or “linear”. Defaults to “embed”.
- use_output_layer (bool , optional) – Whether to use an output layer. Defaults to True.
- pos_enc_class (type , optional) – Class for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Whether to normalize inputs before passing them to decoder layers. Defaults to True.
- concat_after (bool , optional) – Whether to concatenate inputs after attention. Defaults to False.
Returns: A tuple containing: : - x (torch.Tensor): Decoded token scores before softmax <br/> (batch, maxlen_out, vocab_size) if use_output_layer is True.
- olens (torch.Tensor): Lengths of the output sequences (batch,).
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:ValueError – If the input_layer argument is not “embed” or “linear”.

####### Examples

>>> decoder = MLMDecoder(vocab_size=100, encoder_output_size=512)
>>> hs_pad = torch.randn(32, 10, 512)  # (batch, maxlen_in, feat)
>>> hlens = torch.tensor([10] * 32)     # (batch)
>>> ys_in_pad = torch.randint(0, 100, (32, 15))  # (batch, maxlen_out)
>>> ys_in_lens = torch.tensor([15] * 32)  # (batch)
>>> output, output_lengths = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)

NOTE

This decoder is typically used in conjunction with an encoder in sequence-to-sequence models for tasks such as automatic speech recognition (ASR) and machine translation.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(hs_pad: Tensor, hlens: Tensor, ys_in_pad: Tensor, ys_in_lens: Tensor) → Tuple[Tensor, Tensor]

Forward decoder.

This method performs the forward pass of the masked language model decoder. It takes the encoded memory from the encoder, input token ids, and their respective lengths to produce decoded token scores before softmax and the output lengths.

Parameters:
- hs_pad (torch.Tensor) – Encoded memory, shape (batch, maxlen_in, feat) with dtype float32.
- hlens (torch.Tensor) – Lengths of the encoded memory, shape (batch).
- ys_in_pad (torch.Tensor) – Input token ids, shape (batch, maxlen_out) with dtype int64. If input_layer is set to “embed”, this should be a tensor of token ids; otherwise, it should be a tensor of shape (batch, maxlen_out, #mels).
- ys_in_lens (torch.Tensor) – Lengths of the input sequences, shape (batch).
Returns: A tuple containing:
- x (torch.Tensor): Decoded token scores before softmax,
shape (batch, maxlen_out, token), only if use_output_layer is True.
- olens (torch.Tensor): Output lengths, shape (batch,).
Return type: Tuple[torch.Tensor, torch.Tensor]

####### Examples

>>> decoder = MLMDecoder(vocab_size=100, encoder_output_size=256)
>>> hs_pad = torch.rand(32, 10, 256)  # (batch, maxlen_in, feat)
>>> hlens = torch.randint(1, 10, (32,))  # (batch)
>>> ys_in_pad = torch.randint(0, 100, (32, 15))  # (batch, maxlen_out)
>>> ys_in_lens = torch.randint(1, 15, (32,))  # (batch)
>>> output, output_lengths = decoder.forward(hs_pad, hlens, ys_in_pad, ys_in_lens)

NOTE

Ensure that the input tensor shapes are consistent with the specified dimensions, and that the model has been properly initialized before calling this method.

Raises:ValueError – If the input_layer is neither “embed” nor “linear”.