espnet2.enh.layers.dprnn.DPRNN_TAC

About 2 min

espnet2.enh.layers.dprnn.DPRNN_TAC

class espnet2.enh.layers.dprnn.DPRNN_TAC(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)

Bases: Module

Deep duaL-path RNN with TAC applied to each layer/block.

This class implements a deep dual-path RNN architecture with Transform-Average-Concatenate (TAC) applied to each layer/block. It is designed for efficient processing of 3D input data and leverages the capabilities of RNNs for time-domain single-channel speech separation.

input_size

Dimension of the input feature.

Type: int

output_size

Dimension of the output size.

Type: int

hidden_size

Dimension of the hidden state.

Type: int

row_rnn

List of row RNNs for intra-segment processing.

Type: nn.ModuleList

col_rnn

List of column RNNs for inter-segment processing.

Type: nn.ModuleList

ch_transform

List of transformation layers for channels.

Type: nn.ModuleList

ch_average

List of average pooling layers for channels.

Type: nn.ModuleList

ch_concat

List of concatenation layers for channels.

Type: nn.ModuleList

row_norm

List of normalization layers for row outputs.

Type: nn.ModuleList

col_norm

List of normalization layers for column outputs.

Type: nn.ModuleList

ch_norm

List of normalization layers for channel outputs.

Type: nn.ModuleList

output

Output layer that processes the final output.

Type: nn.Sequential
Parameters:
- rnn_type (str) – Type of RNN to use. Must be one of ‘RNN’, ‘LSTM’, or ‘GRU’.
- input_size (int) – Dimension of the input feature. The input should have shape (batch, seq_len, input_size).
- hidden_size (int) – Dimension of the hidden state.
- output_size (int) – Dimension of the output size.
- dropout (float) – Dropout ratio. Default is 0.
- num_layers (int) – Number of stacked RNN layers. Default is 1.
- bidirectional (bool) – Whether the RNN layers are bidirectional. Default is False.

####### Examples

>>> model = DPRNN_TAC(rnn_type='LSTM', input_size=64, hidden_size=128,
...                   output_size=64, dropout=0.1, num_layers=2)
>>> input_tensor = torch.randn(32, 4, 16, 64)  # (batch, ch, N, dim1, dim2)
>>> num_mic = torch.tensor([2] * 32)  # Assume all inputs have 2 microphones
>>> output = model(input_tensor, num_mic)

NOTE

The model supports both fixed geometry arrays and variable geometry arrays based on the num_mic parameter.

Raises:AssertionError – If rnn_type is not one of the supported types (‘RNN’, ‘LSTM’, ‘GRU’).

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, num_mic)

Forward pass for the DPRNN_TAC model.

This method processes the input through the dual-path RNN with TAC applied to each layer/block. The model first applies RNN on the ‘dim1’ dimension, followed by ‘dim2’, and finally across channels.

Parameters:
- input (torch.Tensor) – Input tensor of shape (batch, ch, N, dim1, dim2), where ‘ch’ is the number of channels, ‘N’ is the sequence length, and ‘dim1’, ‘dim2’ are the dimensions of the input features.
- num_mic (torch.Tensor) – A tensor of shape (batch,) indicating the number of microphones used for each batch item.
Returns: The output tensor after processing, of shape : (B, ch, N, dim1, dim2), where ‘B’ is the batch size.
Return type: torch.Tensor

####### Examples

>>> model = DPRNN_TAC('LSTM', input_size=64, hidden_size=128,
...                   output_size=64)
>>> input_tensor = torch.randn(10, 4, 20, 32, 32)  # Batch of 10
>>> num_mic = torch.tensor([2, 2, 1, 0, 2, 1, 2, 0, 1, 2])
>>> output = model(input_tensor, num_mic)
>>> output.shape
torch.Size([10, 4, 20, 32, 32])