espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder
espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder
class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: ~typing.Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)
Bases: BaseTransformerDecoder
Lightweight Convolution 2D Transformer Decoder.
This class implements a transformer decoder that utilizes lightweight 2D convolutions in its architecture. It inherits from the BaseTransformerDecoder class and is designed to facilitate sequence-to-sequence tasks, such as automatic speech recognition.
vocab_size
Size of the vocabulary.
- Type: int
encoder_output_size
Dimension of the encoder’s output.
- Type: int
attention_heads
Number of attention heads in multi-head attention.
- Type: int
linear_units
Number of units in position-wise feed forward networks.
- Type: int
num_blocks
Number of decoder blocks.
- Type: int
dropout_rate
Dropout rate applied in various layers.
- Type: float
positional_dropout_rate
Dropout rate for positional encoding.
- Type: float
self_attention_dropout_rate
Dropout rate for self-attention.
- Type: float
src_attention_dropout_rate
Dropout rate for source attention.
- Type: float
input_layer
Type of input layer (‘embed’ or ‘linear’).
- Type: str
use_output_layer
Flag to indicate if output layer is used.
- Type: bool
pos_enc_class
Class used for positional encoding.
normalize_before
Flag to indicate if normalization is applied before the first block.
- Type: bool
concat_after
Flag to indicate if concatenation is applied after attention.
- Type: bool
conv_wshare
Number of shared weights for convolution.
- Type: int
conv_kernel_length
Lengths of convolution kernels for each block.
- Type: Sequence[int]
conv_usebias
Flag to indicate if bias is used in convolutions.
Type: bool
Parameters:
- vocab_size (int) – Size of the vocabulary.
- encoder_output_size (int) – Dimension of the encoder’s output.
- attention_heads (int , optional) – Number of attention heads. Default is 4.
- linear_units (int , optional) – Number of units in position-wise feed forward networks. Default is 2048.
- num_blocks (int , optional) – Number of decoder blocks. Default is 6.
- dropout_rate (float , optional) – Dropout rate. Default is 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encoding. Default is 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self-attention. Default is 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Default is 0.0.
- input_layer (str , optional) – Type of input layer (‘embed’ or ‘linear’). Default is ‘embed’.
- use_output_layer (bool , optional) – Flag to indicate if output layer is used. Default is True.
- pos_enc_class – Class used for positional encoding. Default is PositionalEncoding.
- normalize_before (bool , optional) – Flag to indicate if normalization is applied before the first block. Default is True.
- concat_after (bool , optional) – Flag to indicate if concatenation is applied after attention. Default is False.
- conv_wshare (int , optional) – Number of shared weights for convolution. Default is 4.
- conv_kernel_length (Sequence *[*int ] , optional) – Lengths of convolution kernels for each block. Default is (11, 11, 11, 11, 11, 11).
- conv_usebias (bool , optional) – Flag to indicate if bias is used in convolutions. Default is False.
Raises:ValueError – If the length of conv_kernel_length does not match num_blocks.
Examples
>>> decoder = LightweightConvolution2DTransformerDecoder(
... vocab_size=1000,
... encoder_output_size=512,
... num_blocks=6,
... conv_kernel_length=[3, 5, 7, 9, 11, 13]
... )
>>> hs_pad = torch.rand(32, 50, 512) # Batch of encoded memory
>>> hlens = torch.tensor([50] * 32) # Lengths of the input
>>> ys_in_pad = torch.randint(0, 1000, (32, 20)) # Input tokens
>>> ys_in_lens = torch.tensor([20] * 32) # Lengths of the output
>>> output, olens = decoder(hs_pad, hlens, ys_in_pad, ys_in_lens)
Initialize internal Module state, shared by both nn.Module and ScriptModule.