espnet2.legacy.nets.pytorch_backend.transformer.encoder.Encoder

About 1 min

espnet2.legacy.nets.pytorch_backend.transformer.encoder.Encoder

class espnet2.legacy.nets.pytorch_backend.transformer.encoder.Encoder(idim, attention_dim=256, attention_heads=4, conv_wshare=4, conv_kernel_length='11', conv_usebias=False, linear_units=2048, num_blocks=6, dropout_rate=0.1, positional_dropout_rate=0.1, attention_dropout_rate=0.0, input_layer='conv2d', pos_enc_class=<class 'espnet2.legacy.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before=True, concat_after=False, positionwise_layer_type='linear', positionwise_conv_kernel_size=1, selfattention_layer_type='selfattn', padding_idx=-1, stochastic_depth_rate=0.0, intermediate_layers=None, ctc_softmax=None, conditioning_layer_dim=None)

Bases: Module

Transformer encoder module.

Parameters:
- idim (int) – Input dimension.
- attention_dim (int) – Dimension of attention.
- attention_heads (int) – The number of heads of multi head attention.
- conv_wshare (int) – The number of kernel of convolution. Only used in selfattention_layer_type == “lightconv*” or “dynamiconv*”.
- conv_kernel_length (Union *[*int , str ]) – Kernel size str of convolution (e.g. 71_71_71_71_71_71). Only used in selfattention_layer_type == “lightconv*” or “dynamiconv*”.
- conv_usebias (bool) – Whether to use bias in convolution. Only used in selfattention_layer_type == “lightconv*” or “dynamiconv*”.
- linear_units (int) – The number of units of position-wise feed forward.
- num_blocks (int) – The number of decoder blocks.
- dropout_rate (float) – Dropout rate.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding.
- attention_dropout_rate (float) – Dropout rate in attention.
- input_layer (Union *[*str , torch.nn.Module ]) – Input layer type.
- pos_enc_class (torch.nn.Module) – Positional encoding module class. PositionalEncoding `or `ScaledPositionalEncoding
- normalize_before (bool) – Whether to use layer_norm before the first block.
- concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
- positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
- positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
- selfattention_layer_type (str) – Encoder attention layer type.
- padding_idx (int) – Padding idx for input_layer=embed.
- stochastic_depth_rate (float) – Maximum probability to skip the encoder layer.
- intermediate_layers (Union *[*List *[*int ] , None ]) – indices of intermediate CTC layer. indices start from 1. if not None, intermediate outputs are returned (which changes return type signature.)

Construct an Encoder object.

forward(xs, masks)

Encode input sequence.

Parameters:
- xs (torch.Tensor) – Input tensor (#batch, time, idim).
- masks (torch.Tensor) – Mask tensor (#batch, 1, time).
Returns: Output tensor (#batch, time, attention_dim). torch.Tensor: Mask tensor (#batch, 1, time).
Return type: torch.Tensor

forward_one_step(xs, masks, , cache=None)

Encode input frame.

Parameters:
- xs (torch.Tensor) – Input tensor.
- masks (torch.Tensor) – Mask tensor.
- cache (List *[*torch.Tensor ]) – List of cache tensors.
Returns: Output tensor. torch.Tensor: Mask tensor. List[torch.Tensor]: List of new cache tensors.
Return type: torch.Tensor

get_positionwise_layer(positionwise_layer_type='linear', attention_dim=256, linear_units=2048, dropout_rate=0.1, positionwise_conv_kernel_size=1)

Define positionwise layer.