espnet2.enh.separator.dpcl_e2e_separator.DPCLE2ESeparator
espnet2.enh.separator.dpcl_e2e_separator.DPCLE2ESeparator
class espnet2.enh.separator.dpcl_e2e_separator.DPCLE2ESeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0, alpha: float = 5.0, max_iteration: int = 500, threshold: float = 1e-05)
Bases: AbsSeparator
Deep Clustering End-to-End Separator.
This class implements a deep clustering approach for single-channel multi-speaker separation. The model utilizes a recurrent neural network (RNN) architecture to learn speaker-specific masks from the input audio features.
References
Single-Channel Multi-Speaker Separation using Deep Clustering; Yusuf Isik et al., 2016; https://www.isca-speech.org/archive/interspeech_2016/isik16_interspeech.html
- Parameters:
- input_dim (int) – Input feature dimension.
- rnn_type (str) – Type of RNN to use. Options include ‘blstm’, ‘lstm’, etc.
- num_spk (int) – Number of speakers in the input audio.
- predict_noise (bool) – Whether to output the estimated noise signal.
- nonlinear (str) – Nonlinear function for mask estimation. Options: ‘relu’, ‘tanh’, ‘sigmoid’.
- layer (int) – Number of stacked RNN layers. Default is 2.
- unit (int) – Dimension of the hidden state.
- emb_D (int) – Dimension of the feature vector for a tf-bin.
- dropout (float) – Dropout ratio. Default is 0.0.
- alpha (float) – Clustering hardness parameter.
- max_iteration (int) – Maximum iterations for soft k-means.
- threshold (float) – Threshold to end the soft k-means process.
- Returns: None.
####### Examples
separator = DPCLE2ESeparator(input_dim=257, num_spk=2) input_features = torch.randn(10, 100, 257) # (Batch, Time, Frequency) input_lengths = torch.tensor([100] * 10) # Lengths of each input sequence masked_outputs, lengths, others = separator(input_features, input_lengths)
NOTE
This separator is designed to work with both real and complex input tensors. Ensure the input features are properly formatted before passing them to the forward method.
- Raises:ValueError – If an unsupported nonlinear activation function is provided.
Deep Clustering End-to-End Separator
References
Single-Channel Multi-Speaker Separation using Deep Clustering; Yusuf Isik. et al., 2016; https://www.isca-speech.org/archive/interspeech_2016/isik16_interspeech.html
- Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- unit – int, dimension of the hidden state.
- emb_D – int, dimension of the feature vector for a tf-bin.
- dropout – float, dropout ratio. Default is 0.
- alpha – float, the clustering hardness parameter.
- max_iteration – int, the max iterations of soft kmeans.
- threshold – float, the threshold to end the soft k-means process.
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward pass for the DPCLE2ESeparator.
This method takes the encoded features and performs the forward pass through the model to separate the sources. It applies a soft K-means clustering algorithm to estimate the masks for each speaker.
Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature tensor of shape [B, T, F] where B is batch size, T is time frames, and F is the number of frequency bins.
- ilens (torch.Tensor) – Tensor of input lengths of shape [Batch].
- additional (Optional *[*Dict ] , optional) – Additional information that can be passed to the forward method. Defaults to None.
Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, OrderedDict]: A tuple containing:
- masked (List[Union[torch.Tensor, ComplexTensor]]): List of tensors, each of shape (B, T, F) representing the separated sources [(B, T, N), …].
- ilens (torch.Tensor): Tensor containing the lengths of each output in the batch.
- others (OrderedDict): Contains additional predicted data, e.g., masks for each speaker:
- ’mask_spk1’: torch.Tensor(Batch, Frames, Freq),
- ’mask_spk2’: torch.Tensor(Batch, Frames, Freq),
…
- ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq).
####### Examples
>>> separator = DPCLE2ESeparator(input_dim=128)
>>> input_tensor = torch.randn(10, 100, 128) # Batch of 10
>>> ilens = torch.tensor([100] * 10) # All sequences of length 100
>>> masked, ilens_out, others = separator.forward(input_tensor, ilens)
NOTE
The output masks can be applied to the input features to obtain the estimated sources.
- Raises:
- ValueError – If the input is not a valid tensor or if ilens does
- not match the batch size. –
property num_spk