espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder
espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder
class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)
Bases: BaseTransformerDecoder
Lightweight Convolution Transformer Decoder.
This class implements a transformer decoder that utilizes lightweight convolution layers in its architecture. It is designed for tasks such as automatic speech recognition (ASR) and can be used as a part of larger neural network models.
vocab_size
The size of the vocabulary.
- Type: int
encoder_output_size
The output dimension of the encoder.
- Type: int
attention_heads
The number of attention heads for multi-head attention.
- Type: int
linear_units
The number of units in the position-wise feed forward layer.
- Type: int
num_blocks
The number of decoder blocks in the architecture.
- Type: int
dropout_rate
The dropout rate to apply to layers.
- Type: float
positional_dropout_rate
The dropout rate for positional encodings.
- Type: float
self_attention_dropout_rate
The dropout rate for self attention.
- Type: float
src_attention_dropout_rate
The dropout rate for source attention.
- Type: float
input_layer
The type of input layer to use (‘embed’ or ‘linear’).
- Type: str
use_output_layer
Flag indicating whether to use an output layer.
- Type: bool
pos_enc_class
The class used for positional encoding.
normalize_before
Whether to apply layer normalization before the first block.
- Type: bool
concat_after
Whether to concatenate the input and output of the attention layer.
- Type: bool
conv_wshare
The number of shared weights for convolutional layers.
- Type: int
conv_kernel_length
A sequence specifying the kernel length for each convolutional layer.
- Type: Sequence[int]
conv_usebias
Whether to use bias in convolutional layers.
Type: bool
Parameters:
- vocab_size (int) – The size of the vocabulary.
- encoder_output_size (int) – The output dimension of the encoder.
- attention_heads (int , optional) – The number of attention heads. Defaults to 4.
- linear_units (int , optional) – The number of units in the position-wise feed forward layer. Defaults to 2048.
- num_blocks (int , optional) – The number of decoder blocks. Defaults to 6.
- dropout_rate (float , optional) – The dropout rate. Defaults to 0.1.
- positional_dropout_rate (float , optional) – The dropout rate for positional encodings. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – The dropout rate for self-attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – The dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – The type of input layer (‘embed’ or ‘linear’). Defaults to ‘embed’.
- use_output_layer (bool , optional) – Flag indicating whether to use an output layer. Defaults to True.
- pos_enc_class – The class used for positional encoding. Defaults to PositionalEncoding.
- normalize_before (bool , optional) – Whether to apply layer normalization before the first block. Defaults to True.
- concat_after (bool , optional) – Whether to concatenate the input and output of the attention layer. Defaults to False.
- conv_wshare (int , optional) – The number of shared weights for convolutional layers. Defaults to 4.
- conv_kernel_length (Sequence *[*int ] , optional) – A sequence specifying the kernel length for each convolutional layer. Defaults to (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Whether to use bias in convolutional layers. Defaults to False.
Raises:ValueError – If the length of conv_kernel_length does not match num_blocks.
Examples
>>> decoder = LightweightConvolutionTransformerDecoder(
... vocab_size=5000,
... encoder_output_size=256,
... num_blocks=6,
... conv_kernel_length=[3, 5, 7, 9, 11, 13]
... )
>>> input_tensor = torch.randint(0, 5000, (32, 10)) # (batch, seq_len)
>>> output, olens = decoder.forward(input_tensor, hlens=None, ys_in_pad=input_tensor, ys_in_lens=None)
NOTE
This implementation is suitable for both training and inference scenarios. The forward method is used to process the input data through the decoder layers.
Initialize internal Module state, shared by both nn.Module and ScriptModule.