espnet2.enh.layers.dptnet.ImprovedTransformerLayer

About 2 min

espnet2.enh.layers.dptnet.ImprovedTransformerLayer

class espnet2.enh.layers.dptnet.ImprovedTransformerLayer(rnn_type, input_size, att_heads, hidden_size, dropout=0.0, activation='relu', bidirectional=True, norm='gLN')

Bases: Module

Container module of the (improved) Transformer proposed in [1].

This class implements the Improved Transformer Layer as part of the Dual-Path Transformer Network (DPTNet) architecture. It incorporates a multi-head self-attention mechanism followed by a feed-forward network, and can utilize various RNN types for processing the input features. This layer is designed for applications such as end-to-end monaural speech separation.

Reference: : Chen, J., Mao, Q., & Liu, D. (2020). Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation. In Proc. ISCA Interspeech (pp. 2642–2646).

rnn_type

Type of RNN used (‘RNN’, ‘LSTM’, or ‘GRU’).

Type: str

att_heads

Number of attention heads.

Type: int

self_attn

Multi-head self-attention layer.

Type: nn.MultiheadAttention

dropout

Dropout layer for regularization.

Type: nn.Dropout

norm_attn

Normalization layer for attention output.

rnn

RNN layer based on specified rnn_type.

Type: nn.Module

feed_forward

Feed-forward network following the RNN.

Type: nn.Sequential

norm_ff

Normalization layer for feed-forward output.

Parameters:
- rnn_type (str) – Select from ‘RNN’, ‘LSTM’, and ‘GRU’.
- input_size (int) – Dimension of the input feature.
- att_heads (int) – Number of attention heads.
- hidden_size (int) – Dimension of the hidden state.
- dropout (float) – Dropout ratio. Default is 0.
- activation (str) – Activation function applied at the output of RNN.
- bidirectional (bool , optional) – True for bidirectional Inter-Chunk RNN (Intra-Chunk is always bidirectional).
- norm (str , optional) – Type of normalization to use.

####### Examples

>>> layer = ImprovedTransformerLayer(
...     rnn_type='LSTM',
...     input_size=256,
...     att_heads=4,
...     hidden_size=128,
...     dropout=0.1,
...     activation='relu'
... )
>>> input_tensor = torch.randn(10, 20, 256)  # (batch, seq_len, input_size)
>>> output_tensor = layer(input_tensor)

Raises:AssertionError – If rnn_type is not one of ‘RNN’, ‘LSTM’, or ‘GRU’.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x, attn_mask=None)

Forward pass through the Improved Transformer Layer.

This method takes the input tensor x, applies self-attention, a feed-forward neural network, and normalization, returning the transformed output.

Parameters:
- x (torch.Tensor) – Input tensor of shape (batch, seq, input_size).
- attn_mask (torch.Tensor , optional) – Attention mask to prevent attention to certain positions. Default is None.
Returns: Output tensor of the same shape as input x after applying the transformer layer.
Return type: torch.Tensor

####### Examples

>>> layer = ImprovedTransformerLayer('LSTM', 128, 4, 64)
>>> input_tensor = torch.randn(32, 10, 128)  # (batch, seq, input_size)
>>> output_tensor = layer(input_tensor)
>>> print(output_tensor.shape)  # Should output: torch.Size([32, 10, 128])

NOTE

The input tensor x should have dimensions corresponding to (batch size, sequence length, input size).