espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder
espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder
class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)
Bases: BaseTransformerDecoder
Dynamic Convolution Transformer Decoder for sequence generation.
This class implements a dynamic convolution mechanism in the Transformer decoder architecture, allowing for adaptive convolutional operations within the decoder layers. It is designed to facilitate improved performance in tasks such as automatic speech recognition (ASR).
vocab_size
Size of the vocabulary for output tokens.
- Type: int
encoder_output_size
Size of the encoder’s output features.
- Type: int
attention_heads
Number of attention heads for multi-head attention.
- Type: int
linear_units
Number of units in the position-wise feed forward layer.
- Type: int
num_blocks
Number of decoder blocks to stack.
- Type: int
dropout_rate
Dropout rate for regularization.
- Type: float
positional_dropout_rate
Dropout rate for positional encoding.
- Type: float
self_attention_dropout_rate
Dropout rate for self-attention.
- Type: float
src_attention_dropout_rate
Dropout rate for source attention.
- Type: float
input_layer
Type of input layer (‘embed’ or ‘linear’).
- Type: str
use_output_layer
Flag to determine if an output layer is used.
- Type: bool
pos_enc_class
Class for positional encoding.
normalize_before
Flag for normalization before the first block.
- Type: bool
concat_after
Flag for concatenation of attention inputs and outputs.
- Type: bool
conv_wshare
Number of shared weights for dynamic convolution.
- Type: int
conv_kernel_length
Length of the convolution kernels for each block.
- Type: Sequence[int]
conv_usebias
Flag to use bias in convolution operations.
Type: bool
Parameters:
- vocab_size (int) – Size of the vocabulary for output tokens.
- encoder_output_size (int) – Size of the encoder’s output features.
- attention_heads (int , optional) – Number of attention heads for multi-head attention. Defaults to 4.
- linear_units (int , optional) – Number of units in the position-wise feed forward layer. Defaults to 2048.
- num_blocks (int , optional) – Number of decoder blocks to stack. Defaults to 6.
- dropout_rate (float , optional) – Dropout rate for regularization. Defaults to 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encoding. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self-attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – Type of input layer (‘embed’ or ‘linear’). Defaults to ‘embed’.
- use_output_layer (bool , optional) – Flag to determine if an output layer is used. Defaults to True.
- pos_enc_class – Class for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Flag for normalization before the first block. Defaults to True.
- concat_after (bool , optional) – Flag for concatenation of attention inputs and outputs. Defaults to False.
- conv_wshare (int , optional) – Number of shared weights for dynamic convolution. Defaults to 4.
- conv_kernel_length (Sequence *[*int ] , optional) – Length of the convolution kernels for each block. Defaults to (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Flag to use bias in convolution operations. Defaults to False.
Raises:ValueError – If conv_kernel_length does not match the number of blocks.
Examples
>>> decoder = DynamicConvolutionTransformerDecoder(
... vocab_size=1000,
... encoder_output_size=256,
... num_blocks=6,
... conv_kernel_length=(3, 5, 7, 9, 11, 13)
... )
>>> hs_pad = torch.randn(32, 50, 256) # Example encoded memory
>>> hlens = torch.tensor([50] * 32) # Lengths of encoded memory
>>> ys_in_pad = torch.randint(0, 1000, (32, 30)) # Input tokens
>>> ys_in_lens = torch.tensor([30] * 32) # Lengths of input tokens
>>> output, olens = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)
NOTE
The dynamic convolution allows for varying kernel sizes and shared weights across different decoder layers, enhancing model flexibility and capacity.
Initialize internal Module state, shared by both nn.Module and ScriptModule.