espnet2.asr.encoder.transformer_encoder_multispkr.TransformerEncoder

About 3 min

espnet2.asr.encoder.transformer_encoder_multispkr.TransformerEncoder

class espnet2.asr.encoder.transformer_encoder_multispkr.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, num_blocks_sd: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str | None = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, num_inf: int = 1)

Bases: AbsEncoder

Transformer encoder module for automatic speech recognition (ASR).

This class implements a Transformer-based encoder that processes input features for ASR tasks. It supports multiple speaker-dependent encoder blocks and various input layer types, including linear and convolutional subsampling.

_output_size

The output dimension of the encoder.

Type: int
Parameters:
- input_size (int) – Input dimension of the features.
- output_size (int , optional) – Dimension of attention (default is 256).
- attention_heads (int , optional) – Number of heads in multi-head attention (default is 4).
- linear_units (int , optional) – Number of units in the position-wise feed forward network (default is 2048).
- num_blocks (int , optional) – Number of recognition encoder blocks (default is 6).
- num_blocks_sd (int , optional) – Number of speaker-dependent encoder blocks (default is 6).
- dropout_rate (float , optional) – Dropout rate (default is 0.1).
- positional_dropout_rate (float , optional) – Dropout rate after adding positional encoding (default is 0.1).
- attention_dropout_rate (float , optional) – Dropout rate in attention (default is 0.0).
- input_layer (str , optional) – Type of input layer (default is “conv2d”).
- pos_enc_class – Class for positional encoding (default is PositionalEncoding).
- normalize_before (bool , optional) – Whether to apply layer normalization before the first block (default is True).
- concat_after (bool , optional) – Whether to concatenate input and output of attention layer (default is False).
- positionwise_layer_type (str , optional) – Type of position-wise layer (“linear” or “conv1d”, default is “linear”).
- positionwise_conv_kernel_size (int , optional) – Kernel size for position-wise conv1d layer (default is 1).
- padding_idx (int , optional) – Padding index for input_layer=”embed” (default is -1).
- num_inf (int , optional) – Number of inference outputs (default is 1).

######### Examples

>>> encoder = TransformerEncoder(input_size=80)
>>> xs_pad = torch.randn(10, 50, 80)  # (B, L, D)
>>> ilens = torch.tensor([50] * 10)  # Lengths of input sequences
>>> output, olens, _ = encoder(xs_pad, ilens)

Returns:
- Encoded output tensor of shape (B, num_inf, L, output_size).
- Output lengths tensor of shape (B, num_inf).
- Placeholder for future use (currently None).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]
Raises:
- ValueError – If an unknown input layer type is provided.
- TooShortUttError – If the input tensor is too short for the selected subsampling method.

####### NOTE This encoder is designed for ASR tasks and may require additional components for full model integration.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Process input tensor through the transformer encoder.

This method takes an input tensor and applies the transformer encoder layers, producing an output tensor that represents the encoded features of the input. The method also generates a mask to identify padded elements in the input tensor.

Parameters:
- xs_pad – A tensor of shape (B, L, D) representing the input sequences, where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens – A tensor of shape (B) containing the lengths of the input sequences (without padding).
- prev_states – An optional tensor for previous states, currently not used in this implementation.
Returns:
- A tensor of shape (B, num_inf, L, D) with the position embedded tensor for each inference output.
- A tensor of shape (B, num_inf) with the lengths of the output sequences.
- An optional tensor (currently None) for future use.
Return type: A tuple containing
Raises:TooShortUttError – If the input sequence is too short for the selected subsampling method, this error is raised.

######### Examples

>>> encoder = TransformerEncoder(input_size=128)
>>> xs_pad = torch.randn(10, 20, 128)  # Batch of 10 sequences
>>> ilens = torch.tensor([20] * 10)     # All sequences have length 20
>>> output, olens, _ = encoder.forward(xs_pad, ilens)

####### NOTE This method assumes that the input sequences have been properly preprocessed and padded. The encoder will not function correctly if the input tensor dimensions do not match the expected shapes.

output_size() → int

output_size method.

This method returns the output size of the TransformerEncoder, which is defined during the initialization of the encoder. The output size is the dimension of the attention mechanism used within the encoder.

Returns: The output size of the TransformerEncoder.
Return type: int

######### Examples

encoder = TransformerEncoder(input_size=128, output_size=256) print(encoder.output_size()) # Output: 256

####### NOTE The output size is crucial for ensuring that the dimensions match during the attention computations within the encoder layers.