espnet2.enh.separator.dpcl_separator.DPCLSeparator
espnet2.enh.separator.dpcl_separator.DPCLSeparator
class espnet2.enh.separator.dpcl_separator.DPCLSeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)
Bases: AbsSeparator
Deep Clustering Separator.
This class implements the Deep Clustering method for source separation, which utilizes recurrent neural networks (RNNs) to estimate masks for separating audio sources. It is particularly effective for scenarios with multiple speakers.
References
[1] Deep clustering: Discriminative embeddings for segmentation and : separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631
[2] Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding : Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.isca-speech.org/archive/interspeech_2021/tanaka21_interspeech.html
- Parameters:
- input_dim (int) – Input feature dimension.
- rnn_type (str , optional) – RNN type, select from ‘blstm’, ‘lstm’, etc. Defaults to ‘blstm’.
- num_spk (int , optional) – Number of speakers. Defaults to 2.
- nonlinear (str , optional) – Nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’. Defaults to ‘tanh’.
- layer (int , optional) – Number of stacked RNN layers. Defaults to 2.
- unit (int , optional) – Dimension of the hidden state. Defaults to 512.
- emb_D (int , optional) – Dimension of the feature vector for a tf-bin. Defaults to 40.
- dropout (float , optional) – Dropout ratio. Defaults to 0.0.
- Raises:ValueError – If the nonlinear function is not one of the supported types.
####### Examples
>>> separator = DPCLSeparator(input_dim=80, num_spk=2)
>>> input_tensor = torch.randn(10, 100, 80) # Batch of 10, 100 time steps
>>> ilens = torch.tensor([100] * 10) # All sequences have length 100
>>> masked, ilens_out, others = separator(input_tensor, ilens)
num_spk
The number of speakers that the separator is configured for.
- Type: int
NOTE
The additional argument in the forward method is not used in this model.
Deep Clustering Separator.
References
[1] Deep clustering: Discriminative embeddings for segmentation and : separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631
[2] Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding : Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.isca-speech.org/archive/interspeech_2021/tanaka21_interspeech.html
- Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- unit – int, dimension of the hidden state.
- emb_D – int, dimension of the feature vector for a tf-bin.
- dropout – float, dropout ratio. Default is 0.
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward pass for the DPCLSeparator.
This method processes the input features through the model and generates separated outputs for each speaker. It employs a recurrent neural network to encode the input features and applies K-means clustering to estimate the speaker masks.
- Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature of shape [B, T, F], where B is the batch size, T is the number of time frames, and F is the number of frequency bins. This can be either a standard tensor or a complex tensor.
- ilens (torch.Tensor) – Input lengths of shape [Batch], indicating the actual lengths of the sequences in the batch.
- additional (Optional *[*Dict ]) – Other data included in the model. This argument is currently not used in this model.
- Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, : > OrderedDict]:
- masked (List[Union[torch.Tensor, ComplexTensor]]): A list of tensors of shape [(B, T, N), …] representing the separated audio signals for each speaker.
- ilens (torch.Tensor): Tensor of shape (B,) containing the lengths of the input sequences after processing.
- others (OrderedDict): Contains additional predicted data, such as:
- ’tf_embedding’: learned embedding of all T-F bins of shape (B, T * F, D).
####### Examples
>>> separator = DPCLSeparator(input_dim=80, num_spk=2)
>>> input_tensor = torch.randn(4, 100, 80) # Example input
>>> input_lengths = torch.tensor([100, 100, 80, 60]) # Lengths
>>> masked, lengths, others = separator.forward(input_tensor,
... input_lengths)
NOTE
This method currently does not utilize the ‘additional’ argument.
- Raises:ValueError – If the input ‘nonlinear’ activation function is not one of ‘sigmoid’, ‘relu’, or ‘tanh’.
property num_spk