espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder

About 2 min

espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)

Bases: BaseTransformerDecoder

Lightweight Convolution 2D Transformer Decoder.

This class implements a transformer decoder that utilizes lightweight 2D convolutions in its architecture. It inherits from the BaseTransformerDecoder class and is designed to facilitate sequence-to-sequence tasks, such as automatic speech recognition.

vocab_size

Size of the vocabulary.

Type: int

encoder_output_size

Dimension of the encoder’s output.

Type: int

attention_heads

Number of attention heads in multi-head attention.

Type: int

linear_units

Number of units in position-wise feed forward networks.

Type: int

num_blocks

Number of decoder blocks.

Type: int

dropout_rate

Dropout rate applied in various layers.

Type: float

positional_dropout_rate

Dropout rate for positional encoding.

Type: float

self_attention_dropout_rate

Dropout rate for self-attention.

Type: float

src_attention_dropout_rate

Dropout rate for source attention.

Type: float

input_layer

Type of input layer (‘embed’ or ‘linear’).

Type: str

use_output_layer

Flag to indicate if output layer is used.

Type: bool

pos_enc_class

Class used for positional encoding.

normalize_before

Flag to indicate if normalization is applied before the first block.

Type: bool

concat_after

Flag to indicate if concatenation is applied after attention.

Type: bool

conv_wshare

Number of shared weights for convolution.

Type: int

conv_kernel_length

Lengths of convolution kernels for each block.

Type: Sequence[int]

conv_usebias

Flag to indicate if bias is used in convolutions.

Type: bool
Parameters:
- vocab_size (int) – Size of the vocabulary.
- encoder_output_size (int) – Dimension of the encoder’s output.
- attention_heads (int , optional) – Number of attention heads. Default is 4.
- linear_units (int , optional) – Number of units in position-wise feed forward networks. Default is 2048.
- num_blocks (int , optional) – Number of decoder blocks. Default is 6.
- dropout_rate (float , optional) – Dropout rate. Default is 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encoding. Default is 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self-attention. Default is 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Default is 0.0.
- input_layer (str , optional) – Type of input layer (‘embed’ or ‘linear’). Default is ‘embed’.
- use_output_layer (bool , optional) – Flag to indicate if output layer is used. Default is True.
- pos_enc_class – Class used for positional encoding. Default is PositionalEncoding.
- normalize_before (bool , optional) – Flag to indicate if normalization is applied before the first block. Default is True.
- concat_after (bool , optional) – Flag to indicate if concatenation is applied after attention. Default is False.
- conv_wshare (int , optional) – Number of shared weights for convolution. Default is 4.
- conv_kernel_length (Sequence *[*int ] , optional) – Lengths of convolution kernels for each block. Default is (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Flag to indicate if bias is used in convolutions. Default is False.
Raises:ValueError – If the length of conv_kernel_length does not match num_blocks.

Examples

>>> decoder = LightweightConvolution2DTransformerDecoder(
...     vocab_size=1000,
...     encoder_output_size=512,
...     num_blocks=6,
...     conv_kernel_length=[3, 5, 7, 9, 11, 13]
... )
>>> hs_pad = torch.rand(32, 50, 512)  # Batch of encoded memory
>>> hlens = torch.tensor([50] * 32)  # Lengths of the input
>>> ys_in_pad = torch.randint(0, 1000, (32, 20))  # Input tokens
>>> ys_in_lens = torch.tensor([20] * 32)  # Lengths of the output
>>> output, olens = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)

Initialize internal Module state, shared by both nn.Module and ScriptModule.