espnet2.asr.decoder.mlm_decoder.MLMDecoder
espnet2.asr.decoder.mlm_decoder.MLMDecoder
class espnet2.asr.decoder.mlm_decoder.MLMDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)
Bases: AbsDecoder
Masked LM Decoder definition for sequence-to-sequence models.
This class implements a masked language model decoder that utilizes multi-head attention and position-wise feed-forward networks. It is designed to handle the decoding of sequences while incorporating positional encodings and normalization techniques.
embed
The embedding layer that converts input token IDs to embeddings. Can be an embedding layer or a linear layer.
- Type: torch.nn.Sequential
normalize_before
Indicates whether to apply normalization before the decoder layers.
- Type: bool
after_norm
Layer normalization applied after the decoder layers if normalize_before is True.
- Type:LayerNorm, optional
output_layer
Linear layer for output if use_output_layer is True.
- Type: torch.nn.Linear, optional
decoders
A list of decoder layers that process the input embeddings and produce output scores.
Type: torch.nn.ModuleList
Parameters:
- vocab_size (int) – Size of the vocabulary, including a mask token.
- encoder_output_size (int) – Size of the encoder output features.
- attention_heads (int , optional) – Number of attention heads. Defaults to 4.
- linear_units (int , optional) – Number of units in the feed-forward layers. Defaults to 2048.
- num_blocks (int , optional) – Number of decoder layers. Defaults to 6.
- dropout_rate (float , optional) – Dropout rate for regularization. Defaults to 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encodings. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – Type of input layer, either “embed” or “linear”. Defaults to “embed”.
- use_output_layer (bool , optional) – Whether to use an output layer. Defaults to True.
- pos_enc_class (type , optional) – Class for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Whether to normalize inputs before passing them to decoder layers. Defaults to True.
- concat_after (bool , optional) – Whether to concatenate inputs after attention. Defaults to False.
Returns: A tuple containing: : - x (torch.Tensor): Decoded token scores before softmax <br/> (batch, maxlen_out, vocab_size) if use_output_layer is True.
- olens (torch.Tensor): Lengths of the output sequences (batch,).
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:ValueError – If the input_layer argument is not “embed” or “linear”.
####### Examples
>>> decoder = MLMDecoder(vocab_size=100, encoder_output_size=512)
>>> hs_pad = torch.randn(32, 10, 512) # (batch, maxlen_in, feat)
>>> hlens = torch.tensor([10] * 32) # (batch)
>>> ys_in_pad = torch.randint(0, 100, (32, 15)) # (batch, maxlen_out)
>>> ys_in_lens = torch.tensor([15] * 32) # (batch)
>>> output, output_lengths = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)
NOTE
This decoder is typically used in conjunction with an encoder in sequence-to-sequence models for tasks such as automatic speech recognition (ASR) and machine translation.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(hs_pad: Tensor, hlens: Tensor, ys_in_pad: Tensor, ys_in_lens: Tensor) → Tuple[Tensor, Tensor]
Forward decoder.
This method performs the forward pass of the masked language model decoder. It takes the encoded memory from the encoder, input token ids, and their respective lengths to produce decoded token scores before softmax and the output lengths.
- Parameters:
- hs_pad (torch.Tensor) – Encoded memory, shape (batch, maxlen_in, feat) with dtype float32.
- hlens (torch.Tensor) – Lengths of the encoded memory, shape (batch).
- ys_in_pad (torch.Tensor) – Input token ids, shape (batch, maxlen_out) with dtype int64. If input_layer is set to “embed”, this should be a tensor of token ids; otherwise, it should be a tensor of shape (batch, maxlen_out, #mels).
- ys_in_lens (torch.Tensor) – Lengths of the input sequences, shape (batch).
- Returns: A tuple containing:
- x (torch.Tensor): Decoded token scores before softmax,
shape (batch, maxlen_out, token), only if use_output_layer is True.
- olens (torch.Tensor): Output lengths, shape (batch,).
- Return type: Tuple[torch.Tensor, torch.Tensor]
####### Examples
>>> decoder = MLMDecoder(vocab_size=100, encoder_output_size=256)
>>> hs_pad = torch.rand(32, 10, 256) # (batch, maxlen_in, feat)
>>> hlens = torch.randint(1, 10, (32,)) # (batch)
>>> ys_in_pad = torch.randint(0, 100, (32, 15)) # (batch, maxlen_out)
>>> ys_in_lens = torch.randint(1, 15, (32,)) # (batch)
>>> output, output_lengths = decoder.forward(hs_pad, hlens, ys_in_pad, ys_in_lens)
NOTE
Ensure that the input tensor shapes are consistent with the specified dimensions, and that the model has been properly initialized before calling this method.
- Raises:ValueError – If the input_layer is neither “embed” nor “linear”.