espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder

About 2 min

espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)

Bases: BaseTransformerDecoder

Dynamic Convolution Transformer Decoder for sequence generation.

This class implements a dynamic convolution mechanism in the Transformer decoder architecture, allowing for adaptive convolutional operations within the decoder layers. It is designed to facilitate improved performance in tasks such as automatic speech recognition (ASR).

vocab_size

Size of the vocabulary for output tokens.

Type: int

encoder_output_size

Size of the encoder’s output features.

Type: int

attention_heads

Number of attention heads for multi-head attention.

Type: int

linear_units

Number of units in the position-wise feed forward layer.

Type: int

num_blocks

Number of decoder blocks to stack.

Type: int

dropout_rate

Dropout rate for regularization.

Type: float

positional_dropout_rate

Dropout rate for positional encoding.

Type: float

self_attention_dropout_rate

Dropout rate for self-attention.

Type: float

src_attention_dropout_rate

Dropout rate for source attention.

Type: float

input_layer

Type of input layer (‘embed’ or ‘linear’).

Type: str

use_output_layer

Flag to determine if an output layer is used.

Type: bool

pos_enc_class

Class for positional encoding.

normalize_before

Flag for normalization before the first block.

Type: bool

concat_after

Flag for concatenation of attention inputs and outputs.

Type: bool

conv_wshare

Number of shared weights for dynamic convolution.

Type: int

conv_kernel_length

Length of the convolution kernels for each block.

Type: Sequence[int]

conv_usebias

Flag to use bias in convolution operations.

Type: bool
Parameters:
- vocab_size (int) – Size of the vocabulary for output tokens.
- encoder_output_size (int) – Size of the encoder’s output features.
- attention_heads (int , optional) – Number of attention heads for multi-head attention. Defaults to 4.
- linear_units (int , optional) – Number of units in the position-wise feed forward layer. Defaults to 2048.
- num_blocks (int , optional) – Number of decoder blocks to stack. Defaults to 6.
- dropout_rate (float , optional) – Dropout rate for regularization. Defaults to 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encoding. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self-attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – Type of input layer (‘embed’ or ‘linear’). Defaults to ‘embed’.
- use_output_layer (bool , optional) – Flag to determine if an output layer is used. Defaults to True.
- pos_enc_class – Class for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Flag for normalization before the first block. Defaults to True.
- concat_after (bool , optional) – Flag for concatenation of attention inputs and outputs. Defaults to False.
- conv_wshare (int , optional) – Number of shared weights for dynamic convolution. Defaults to 4.
- conv_kernel_length (Sequence *[*int ] , optional) – Length of the convolution kernels for each block. Defaults to (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Flag to use bias in convolution operations. Defaults to False.
Raises:ValueError – If conv_kernel_length does not match the number of blocks.

Examples

>>> decoder = DynamicConvolutionTransformerDecoder(
...     vocab_size=1000,
...     encoder_output_size=256,
...     num_blocks=6,
...     conv_kernel_length=(3, 5, 7, 9, 11, 13)
... )
>>> hs_pad = torch.randn(32, 50, 256)  # Example encoded memory
>>> hlens = torch.tensor([50] * 32)  # Lengths of encoded memory
>>> ys_in_pad = torch.randint(0, 1000, (32, 30))  # Input tokens
>>> ys_in_lens = torch.tensor([30] * 32)  # Lengths of input tokens
>>> output, olens = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)

NOTE

The dynamic convolution allows for varying kernel sizes and shared weights across different decoder layers, enhancing model flexibility and capacity.

Initialize internal Module state, shared by both nn.Module and ScriptModule.