espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder

About 1 min

espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)

Bases: BaseTransformerDecoder

Dynamic Convolution 2D Transformer Decoder.

This class implements a transformer decoder that utilizes dynamic convolution with 2D kernels to enhance the attention mechanism. It is suitable for sequence-to-sequence tasks, particularly in automatic speech recognition (ASR).

Parameters:
- vocab_size (int) – The size of the vocabulary (number of tokens).
- encoder_output_size (int) – The output dimension from the encoder.
- attention_heads (int , optional) – The number of attention heads. Defaults to 4.
- linear_units (int , optional) – The number of units in the position-wise feed-forward layer. Defaults to 2048.
- num_blocks (int , optional) – The number of decoder blocks. Defaults to 6.
- dropout_rate (float , optional) – The dropout rate. Defaults to 0.1.
- positional_dropout_rate (float , optional) – The dropout rate for positional encoding. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – The dropout rate for self-attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – The dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – Type of input layer, either ‘embed’ or ‘linear’. Defaults to ‘embed’.
- use_output_layer (bool , optional) – Whether to use an output layer. Defaults to True.
- pos_enc_class – The class used for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Whether to apply layer normalization before the first block. Defaults to True.
- concat_after (bool , optional) – Whether to concatenate the input and output of the attention layer. Defaults to False.
- conv_wshare (int , optional) – The number of shared weights for convolution. Defaults to 4.
- conv_kernel_length (Sequence *[*int ] , optional) – The lengths of the convolution kernels for each block. Defaults to (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Whether to use bias in the convolution layers. Defaults to False.
Raises:
- ValueError – If the length of conv_kernel_length does not match
- num_blocks –

Examples

>>> decoder = DynamicConvolution2DTransformerDecoder(
...     vocab_size=1000,
...     encoder_output_size=256,
...     num_blocks=6,
...     conv_kernel_length=(3, 5, 7, 9, 11, 13)
... )
>>> hs_pad = torch.randn(32, 50, 256)  # (batch, maxlen_in, feat)
>>> hlens = torch.randint(1, 51, (32,))  # (batch)
>>> ys_in_pad = torch.randint(0, 1000, (32, 20))  # (batch, maxlen_out)
>>> ys_in_lens = torch.randint(1, 21, (32,))  # (batch)
>>> output, olens = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)

NOTE

This decoder is designed to work with the corresponding encoder that provides the necessary context through the hs_pad input.

Initialize internal Module state, shared by both nn.Module and ScriptModule.