espnet2.enh.layers.dprnn.DPRNN_TAC
espnet2.enh.layers.dprnn.DPRNN_TAC
class espnet2.enh.layers.dprnn.DPRNN_TAC(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)
Bases: Module
Deep duaL-path RNN with TAC applied to each layer/block.
This class implements a deep dual-path RNN architecture with Transform-Average-Concatenate (TAC) applied to each layer/block. It is designed for efficient processing of 3D input data and leverages the capabilities of RNNs for time-domain single-channel speech separation.
input_size
Dimension of the input feature.
- Type: int
output_size
Dimension of the output size.
- Type: int
hidden_size
Dimension of the hidden state.
- Type: int
row_rnn
List of row RNNs for intra-segment processing.
- Type: nn.ModuleList
col_rnn
List of column RNNs for inter-segment processing.
- Type: nn.ModuleList
ch_transform
List of transformation layers for channels.
- Type: nn.ModuleList
ch_average
List of average pooling layers for channels.
- Type: nn.ModuleList
ch_concat
List of concatenation layers for channels.
- Type: nn.ModuleList
row_norm
List of normalization layers for row outputs.
- Type: nn.ModuleList
col_norm
List of normalization layers for column outputs.
- Type: nn.ModuleList
ch_norm
List of normalization layers for channel outputs.
- Type: nn.ModuleList
output
Output layer that processes the final output.
Type: nn.Sequential
Parameters:
- rnn_type (str) – Type of RNN to use. Must be one of ‘RNN’, ‘LSTM’, or ‘GRU’.
- input_size (int) – Dimension of the input feature. The input should have shape (batch, seq_len, input_size).
- hidden_size (int) – Dimension of the hidden state.
- output_size (int) – Dimension of the output size.
- dropout (float) – Dropout ratio. Default is 0.
- num_layers (int) – Number of stacked RNN layers. Default is 1.
- bidirectional (bool) – Whether the RNN layers are bidirectional. Default is False.
####### Examples
>>> model = DPRNN_TAC(rnn_type='LSTM', input_size=64, hidden_size=128,
... output_size=64, dropout=0.1, num_layers=2)
>>> input_tensor = torch.randn(32, 4, 16, 64) # (batch, ch, N, dim1, dim2)
>>> num_mic = torch.tensor([2] * 32) # Assume all inputs have 2 microphones
>>> output = model(input_tensor, num_mic)
NOTE
The model supports both fixed geometry arrays and variable geometry arrays based on the num_mic parameter.
- Raises:AssertionError – If rnn_type is not one of the supported types (‘RNN’, ‘LSTM’, ‘GRU’).
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input, num_mic)
Forward pass for the DPRNN_TAC model.
This method processes the input through the dual-path RNN with TAC applied to each layer/block. The model first applies RNN on the ‘dim1’ dimension, followed by ‘dim2’, and finally across channels.
- Parameters:
- input (torch.Tensor) – Input tensor of shape (batch, ch, N, dim1, dim2), where ‘ch’ is the number of channels, ‘N’ is the sequence length, and ‘dim1’, ‘dim2’ are the dimensions of the input features.
- num_mic (torch.Tensor) – A tensor of shape (batch,) indicating the number of microphones used for each batch item.
- Returns: The output tensor after processing, of shape : (B, ch, N, dim1, dim2), where ‘B’ is the batch size.
- Return type: torch.Tensor
####### Examples
>>> model = DPRNN_TAC('LSTM', input_size=64, hidden_size=128,
... output_size=64)
>>> input_tensor = torch.randn(10, 4, 20, 32, 32) # Batch of 10
>>> num_mic = torch.tensor([2, 2, 1, 0, 2, 1, 2, 0, 1, 2])
>>> output = model(input_tensor, num_mic)
>>> output.shape
torch.Size([10, 4, 20, 32, 32])