espnet2.asr.decoder.transformer_decoder.TransformerDecoder
espnet2.asr.decoder.transformer_decoder.TransformerDecoder
class espnet2.asr.decoder.transformer_decoder.TransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, qk_norm: bool = False, use_flash_attn: bool = True)
Bases: BaseTransformerDecoder
Transformer Decoder for sequence-to-sequence tasks.
This class implements a Transformer decoder architecture, which is designed to work in conjunction with a Transformer encoder. The decoder generates sequences based on the encoded representations from the encoder, using mechanisms such as multi-head attention and feed-forward networks.
- Parameters:
- vocab_size (int) – The size of the vocabulary, representing the number of unique tokens in the output.
- encoder_output_size (int) – The dimension of the output from the encoder, which the decoder will attend to.
- attention_heads (int , optional) – The number of attention heads to use in the multi-head attention mechanism. Default is 4.
- linear_units (int , optional) – The number of units in the position-wise feed-forward layer. Default is 2048.
- num_blocks (int , optional) – The number of decoder blocks (layers) to stack. Default is 6.
- dropout_rate (float , optional) – The dropout rate for regularization. Default is 0.1.
- positional_dropout_rate (float , optional) – The dropout rate for positional encoding. Default is 0.1.
- self_attention_dropout_rate (float , optional) – The dropout rate applied to the self-attention mechanism. Default is 0.0.
- src_attention_dropout_rate (float , optional) – The dropout rate applied to the source attention mechanism. Default is 0.0.
- input_layer (str , optional) – The type of input layer to use; either ‘embed’ for embedding layer or ‘linear’ for a linear layer. Default is ‘embed’.
- use_output_layer (bool , optional) – Whether to use an output layer for final token scoring. Default is True.
- pos_enc_class (type , optional) – The class to use for positional encoding, e.g., PositionalEncoding or ScaledPositionalEncoding. Default is PositionalEncoding.
- normalize_before (bool , optional) – Whether to apply layer normalization before the first decoder block. Default is True.
- concat_after (bool , optional) – Whether to concatenate the input and output of the attention layer before applying an additional linear layer. Default is False.
- layer_drop_rate (float , optional) – The dropout rate for layer dropping. Default is 0.0.
- qk_norm (bool , optional) – Whether to apply normalization to the query-key dot product in attention. Default is False.
- use_flash_attn (bool , optional) – Whether to use flash attention for improved performance. Default is True.
Examples
>>> decoder = TransformerDecoder(
... vocab_size=5000,
... encoder_output_size=512,
... attention_heads=8,
... linear_units=2048,
... num_blocks=6,
... )
>>> hs_pad = torch.randn(32, 10, 512) # Batch of 32, 10 time steps
>>> hlens = torch.tensor([10] * 32) # All sequences of length 10
>>> ys_in_pad = torch.randint(0, 5000, (32, 15)) # Batch of 32, 15 tokens
>>> ys_in_lens = torch.tensor([15] * 32) # All sequences of length 15
>>> output, olens = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)
NOTE
The decoder expects the encoder’s output to be passed in as hs_pad, along with the lengths of those sequences in hlens. The ys_in_pad contains the input tokens for the decoder, and ys_in_lens provides the lengths of those input sequences.
- Raises:ValueError – If the input_layer argument is not ‘embed’ or ‘linear’.
Initialize internal Module state, shared by both nn.Module and ScriptModule.