espnet2.enh.separator.dprnn_separator.DPRNNSeparator
espnet2.enh.separator.dprnn_separator.DPRNNSeparator
class espnet2.enh.separator.dprnn_separator.DPRNNSeparator(input_dim: int, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0)
Bases: AbsSeparator
DPRNNSeparator is a Dual-Path RNN (DPRNN) based separator for audio signals.
This class implements a Dual-Path RNN architecture for separating audio signals from multiple speakers. It is designed to process complex input features and output the separated signals for each speaker, along with an optional noise estimate.
num_spk
Number of speakers.
- Type: int
predict_noise
Whether to output the estimated noise signal.
- Type: bool
segment_size
Dual-path segment size.
- Type: int
num_outputs
Number of outputs, including noise if predicted.
- Type: int
dprnn
Instance of the DPRNN model.
Type:DPRNN
Parameters:
- input_dim (int) – Input feature dimension.
- rnn_type (str , optional) – Type of RNN to use (‘RNN’, ‘LSTM’, ‘GRU’). Default is ‘lstm’.
- bidirectional (bool , optional) – Whether the inter-chunk RNN layers are bidirectional. Default is True.
- num_spk (int , optional) – Number of speakers. Default is 2.
- predict_noise (bool , optional) – Whether to output the estimated noise signal. Default is False.
- nonlinear (str , optional) – Nonlinear function for mask estimation. Choose from ‘relu’, ‘tanh’, ‘sigmoid’. Default is ‘relu’.
- layer (int , optional) – Number of stacked RNN layers. Default is 3.
- unit (int , optional) – Dimension of the hidden state. Default is 512.
- segment_size (int , optional) – Dual-path segment size. Default is 20.
- dropout (float , optional) – Dropout ratio. Default is 0.0.
Raises:ValueError – If the specified nonlinear function is not supported.
####### Examples
>>> separator = DPRNNSeparator(input_dim=256, num_spk=2)
>>> input_features = torch.randn(10, 100, 256) # [Batch, Time, Features]
>>> ilens = torch.tensor([100] * 10) # Input lengths
>>> masked, ilens, others = separator(input_features, ilens)
NOTE
The additional argument in the forward method is not used in this model.
- Yields:masked (List[Union[torch.Tensor, ComplexTensor]]) – List of separated signals for each speaker. ilens (torch.Tensor): Input lengths after processing. others (OrderedDict): Additional predicted data such as masks for each speaker.
Dual-Path RNN (DPRNN) Separator
- Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- unit – int, dimension of the hidden state.
- segment_size – dual-path segment size
- dropout – float, dropout ratio. Default is 0.
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward pass of the DPRNN Separator.
This method processes the input features through the Dual-Path RNN (DPRNN) to estimate the masks for the specified number of speakers. It handles both real and complex input tensors.
- Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature tensor of shape [B, T, N], where B is the batch size, T is the number of time frames, and N is the number of frequency bins.
- ilens (torch.Tensor) – Input lengths tensor of shape [Batch], containing the lengths of each input sequence.
- additional (Optional *[*Dict ]) – Additional data included in the model. NOTE: This parameter is not used in this model.
- Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, : > OrderedDict]: A tuple containing:
- masked (List[Union[torch.Tensor, ComplexTensor]]): A list of tensors of shape [(B, T, N), …] where each tensor corresponds to the input multiplied by the estimated mask for each speaker.
- ilens (torch.Tensor): The input lengths tensor of shape (B,).
- others (OrderedDict): A dictionary containing the predicted masks for each speaker, e.g.:
- ’mask_spk1’: torch.Tensor(Batch, Frames, Freq),
- ’mask_spk2’: torch.Tensor(Batch, Frames, Freq),
- …,
- ’mask_spkn’: torch.Tensor(Batch, Frames, Freq).
####### Examples
>>> separator = DPRNNSeparator(input_dim=512, num_spk=2)
>>> input_tensor = torch.randn(8, 100, 512) # Batch of 8, 100 time frames
>>> ilens = torch.tensor([100] * 8) # All sequences are of length 100
>>> masked, ilens_out, others = separator.forward(input_tensor, ilens)
>>> print(len(masked)) # Should print 2 if num_spk=2
>>> print(others.keys()) # Should include 'mask_spk1' and 'mask_spk2'
NOTE
This implementation supports both real-valued and complex-valued input tensors. If the input tensor is complex, the magnitude is used for processing.
- Raises:ValueError – If an unsupported nonlinear activation function is provided.
property num_spk