espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder

About 2 min

espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)

Bases: BaseTransformerDecoder

Lightweight Convolution Transformer Decoder.

This class implements a transformer decoder that utilizes lightweight convolution layers in its architecture. It is designed for tasks such as automatic speech recognition (ASR) and can be used as a part of larger neural network models.

vocab_size

The size of the vocabulary.

Type: int

encoder_output_size

The output dimension of the encoder.

Type: int

attention_heads

The number of attention heads for multi-head attention.

Type: int

linear_units

The number of units in the position-wise feed forward layer.

Type: int

num_blocks

The number of decoder blocks in the architecture.

Type: int

dropout_rate

The dropout rate to apply to layers.

Type: float

positional_dropout_rate

The dropout rate for positional encodings.

Type: float

self_attention_dropout_rate

The dropout rate for self attention.

Type: float

src_attention_dropout_rate

The dropout rate for source attention.

Type: float

input_layer

The type of input layer to use (‘embed’ or ‘linear’).

Type: str

use_output_layer

Flag indicating whether to use an output layer.

Type: bool

pos_enc_class

The class used for positional encoding.

normalize_before

Whether to apply layer normalization before the first block.

Type: bool

concat_after

Whether to concatenate the input and output of the attention layer.

Type: bool

conv_wshare

The number of shared weights for convolutional layers.

Type: int

conv_kernel_length

A sequence specifying the kernel length for each convolutional layer.

Type: Sequence[int]

conv_usebias

Whether to use bias in convolutional layers.

Type: bool
Parameters:
- vocab_size (int) – The size of the vocabulary.
- encoder_output_size (int) – The output dimension of the encoder.
- attention_heads (int , optional) – The number of attention heads. Defaults to 4.
- linear_units (int , optional) – The number of units in the position-wise feed forward layer. Defaults to 2048.
- num_blocks (int , optional) – The number of decoder blocks. Defaults to 6.
- dropout_rate (float , optional) – The dropout rate. Defaults to 0.1.
- positional_dropout_rate (float , optional) – The dropout rate for positional encodings. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – The dropout rate for self-attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – The dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – The type of input layer (‘embed’ or ‘linear’). Defaults to ‘embed’.
- use_output_layer (bool , optional) – Flag indicating whether to use an output layer. Defaults to True.
- pos_enc_class – The class used for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Whether to apply layer normalization before the first block. Defaults to True.
- concat_after (bool , optional) – Whether to concatenate the input and output of the attention layer. Defaults to False.
- conv_wshare (int , optional) – The number of shared weights for convolutional layers. Defaults to 4.
- conv_kernel_length (Sequence *[*int ] , optional) – A sequence specifying the kernel length for each convolutional layer. Defaults to (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Whether to use bias in convolutional layers. Defaults to False.
Raises:ValueError – If the length of conv_kernel_length does not match num_blocks.

Examples

>>> decoder = LightweightConvolutionTransformerDecoder(
...     vocab_size=5000,
...     encoder_output_size=256,
...     num_blocks=6,
...     conv_kernel_length=[3, 5, 7, 9, 11, 13]
... )
>>> input_tensor = torch.randint(0, 5000, (32, 10))  # (batch, seq_len)
>>> output, olens = decoder.forward(input_tensor, hlens=None, ys_in_pad=input_tensor, ys_in_lens=None)

NOTE

This implementation is suitable for both training and inference scenarios. The forward method is used to process the input data through the decoder layers.

Initialize internal Module state, shared by both nn.Module and ScriptModule.