espnet2.enh.separator.dprnn_separator.DPRNNSeparator

About 3 min

espnet2.enh.separator.dprnn_separator.DPRNNSeparator

class espnet2.enh.separator.dprnn_separator.DPRNNSeparator(input_dim: int, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0)

Bases: AbsSeparator

DPRNNSeparator is a Dual-Path RNN (DPRNN) based separator for audio signals.

This class implements a Dual-Path RNN architecture for separating audio signals from multiple speakers. It is designed to process complex input features and output the separated signals for each speaker, along with an optional noise estimate.

num_spk

Number of speakers.

Type: int

predict_noise

Whether to output the estimated noise signal.

Type: bool

segment_size

Dual-path segment size.

Type: int

num_outputs

Number of outputs, including noise if predicted.

Type: int

dprnn

Instance of the DPRNN model.

Type:DPRNN
Parameters:
- input_dim (int) – Input feature dimension.
- rnn_type (str , optional) – Type of RNN to use (‘RNN’, ‘LSTM’, ‘GRU’). Default is ‘lstm’.
- bidirectional (bool , optional) – Whether the inter-chunk RNN layers are bidirectional. Default is True.
- num_spk (int , optional) – Number of speakers. Default is 2.
- predict_noise (bool , optional) – Whether to output the estimated noise signal. Default is False.
- nonlinear (str , optional) – Nonlinear function for mask estimation. Choose from ‘relu’, ‘tanh’, ‘sigmoid’. Default is ‘relu’.
- layer (int , optional) – Number of stacked RNN layers. Default is 3.
- unit (int , optional) – Dimension of the hidden state. Default is 512.
- segment_size (int , optional) – Dual-path segment size. Default is 20.
- dropout (float , optional) – Dropout ratio. Default is 0.0.
Raises:ValueError – If the specified nonlinear function is not supported.

####### Examples

>>> separator = DPRNNSeparator(input_dim=256, num_spk=2)
>>> input_features = torch.randn(10, 100, 256)  # [Batch, Time, Features]
>>> ilens = torch.tensor([100] * 10)  # Input lengths
>>> masked, ilens, others = separator(input_features, ilens)

NOTE

The additional argument in the forward method is not used in this model.

Yields:masked (List[Union[torch.Tensor, ComplexTensor]]) – List of separated signals for each speaker. ilens (torch.Tensor): Input lengths after processing. others (OrderedDict): Additional predicted data such as masks for each speaker.

Dual-Path RNN (DPRNN) Separator

Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- unit – int, dimension of the hidden state.
- segment_size – dual-path segment size
- dropout – float, dropout ratio. Default is 0.

forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]

Forward pass of the DPRNN Separator.

This method processes the input features through the Dual-Path RNN (DPRNN) to estimate the masks for the specified number of speakers. It handles both real and complex input tensors.

Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature tensor of shape [B, T, N], where B is the batch size, T is the number of time frames, and N is the number of frequency bins.
- ilens (torch.Tensor) – Input lengths tensor of shape [Batch], containing the lengths of each input sequence.
- additional (Optional *[*Dict ]) – Additional data included in the model. NOTE: This parameter is not used in this model.
Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, : > OrderedDict]: A tuple containing:
- masked (List[Union[torch.Tensor, ComplexTensor]]): A list of tensors of shape [(B, T, N), …] where each tensor corresponds to the input multiplied by the estimated mask for each speaker.
- ilens (torch.Tensor): The input lengths tensor of shape (B,).
- others (OrderedDict): A dictionary containing the predicted masks for each speaker, e.g.:
  - ’mask_spk1’: torch.Tensor(Batch, Frames, Freq),
  - ’mask_spk2’: torch.Tensor(Batch, Frames, Freq),
  - …,
  - ’mask_spkn’: torch.Tensor(Batch, Frames, Freq).

####### Examples

>>> separator = DPRNNSeparator(input_dim=512, num_spk=2)
>>> input_tensor = torch.randn(8, 100, 512)  # Batch of 8, 100 time frames
>>> ilens = torch.tensor([100] * 8)  # All sequences are of length 100
>>> masked, ilens_out, others = separator.forward(input_tensor, ilens)
>>> print(len(masked))  # Should print 2 if num_spk=2
>>> print(others.keys())  # Should include 'mask_spk1' and 'mask_spk2'

NOTE

This implementation supports both real-valued and complex-valued input tensors. If the input tensor is complex, the magnitude is used for processing.

Raises:ValueError – If an unsupported nonlinear activation function is provided.

property num_spk