espnet2.enh.separator.dc_crn_separator.DC_CRNSeparator
espnet2.enh.separator.dc_crn_separator.DC_CRNSeparator
class espnet2.enh.separator.dc_crn_separator.DC_CRNSeparator(input_dim: int, num_spk: int = 2, predict_noise: bool = False, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels: int = 8, enc_kernel_size: Tuple = (1, 3), enc_padding: Tuple = (0, 1), enc_last_kernel_size: Tuple = (1, 4), enc_last_stride: Tuple = (1, 2), enc_last_padding: Tuple = (0, 1), enc_layers: int = 5, skip_last_kernel_size: Tuple = (1, 3), skip_last_stride: Tuple = (1, 1), skip_last_padding: Tuple = (0, 1), glstm_groups: int = 2, glstm_layers: int = 2, glstm_bidirectional: bool = False, glstm_rearrange: bool = False, mode: str = 'masking', ref_channel: int = 0)
Bases: AbsSeparator
Densely-Connected Convolutional Recurrent Network (DC-CRN) Separator.
This class implements the DC-CRN model for speech separation based on the paper: “Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones” by Tan et al., 2020. The model can operate in two modes: complex spectral mapping or complex masking.
Reference: : Tan, Z., Wang, D., & Chen, Y. (2020). Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones. https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf
- Parameters:
- input_dim (int) – Input feature dimension.
- num_spk (int) – Number of speakers (default: 2).
- predict_noise (bool) – Whether to output the estimated noise signal (default: False).
- input_channels (list) – Number of input channels for the stacked DenselyConnectedBlock layers. Its length should be equal to the number of DenselyConnectedBlock layers (default: [2, 16, 32, 64, 128, 256]).
- enc_hid_channels (int) – Common number of intermediate channels for all DenselyConnectedBlock of the encoder (default: 8).
- enc_kernel_size (tuple) – Common kernel size for all DenselyConnectedBlock of the encoder (default: (1, 3)).
- enc_padding (tuple) – Common padding for all DenselyConnectedBlock of the encoder (default: (0, 1)).
- enc_last_kernel_size (tuple) – Common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder (default: (1, 4)).
- enc_last_stride (tuple) – Common stride for the last Conv layer in all DenselyConnectedBlock of the encoder (default: (1, 2)).
- enc_last_padding (tuple) – Common padding for the last Conv layer in all DenselyConnectedBlock of the encoder (default: (0, 1)).
- enc_layers (int) – Common total number of Conv layers for all DenselyConnectedBlock layers of the encoder (default: 5).
- skip_last_kernel_size (tuple) – Common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways (default: (1, 3)).
- skip_last_stride (tuple) – Common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways (default: (1, 1)).
- skip_last_padding (tuple) – Common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways (default: (0, 1)).
- glstm_groups (int) – Number of groups in each Grouped LSTM layer (default: 2).
- glstm_layers (int) – Number of Grouped LSTM layers (default: 2).
- glstm_bidirectional (bool) – Whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers (default: False).
- glstm_rearrange (bool) – Whether to apply the rearrange operation after each grouped LSTM layer (default: False).
- mode (str) – One of (“mapping”, “masking”). “mapping” for complex spectral mapping and “masking” for complex masking (default: “masking”).
- ref_channel (int) – Index of the reference microphone (default: 0).
- Raises:ValueError – If the provided mode is not supported.
####### Examples
>>> separator = DC_CRNSeparator(input_dim=512, num_spk=2)
>>> input_tensor = torch.randn(10, 20, 512) # Batch of 10, 20 time frames
>>> ilens = torch.tensor([20] * 10) # All sequences are of length 20
>>> masked, ilens, others = separator(input_tensor, ilens)
NOTE
The output masks can be used for separating the sources based on the chosen mode of operation (masking or mapping).
Densely-Connected Convolutional Recurrent Network (DC-CRN) Separator
Reference: : Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones; Tan et al., 2020 https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf
- Parameters:
- input_dim – input feature dimension
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers).
- enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder
- enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder
- enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder
- enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder
- enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder
- enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder
- enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder
- skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways
- skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways
- skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways
- glstm_groups (int) – number of groups in each Grouped LSTM layer
- glstm_layers (int) – number of Grouped LSTM layers
- glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers
- glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer
- output_channels (int) – number of output channels (even number)
- mode (str) – one of (“mapping”, “masking”) “mapping”: complex spectral mapping “masking”: complex masking
- ref_channel (int) – index of the reference microphone
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
DC-CRN Separator Forward.
This method processes the input features through the DC-CRN architecture to separate the signals from multiple speakers. It can operate in either masking or mapping mode based on the specified configuration.
- Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature tensor of shape [Batch, T, F] for real input or [Batch, T, C, F] for complex input, where T is the time dimension, F is the frequency dimension, and C is the number of channels.
- ilens (torch.Tensor) – Input lengths of shape [Batch,].
- additional (Optional *[*Dict ]) – Additional data that can be provided for processing, defaults to None.
- Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, : > OrderedDict]: A tuple containing:
- masked (List[Union[torch.Tensor, ComplexTensor]]): List of tensors representing the masked output for each speaker, with shapes [(Batch, T, F), …].
- ilens (torch.Tensor): Tensor of input lengths with shape (B,).
- others (OrderedDict): Dictionary containing additional predicted data such as masks for each speaker:
- ’mask_spk1’: torch.Tensor(Batch, Frames, Freq)
- ’mask_spk2’: torch.Tensor(Batch, Frames, Freq)
- …
- ’mask_spkn’: torch.Tensor(Batch, Frames, Freq)
####### Examples
>>> separator = DC_CRNSeparator(input_dim=64, num_spk=2)
>>> input_tensor = torch.randn(8, 100, 64) # Example input
>>> ilens = torch.tensor([100] * 8) # Lengths of each input
>>> masked, ilens, others = separator.forward(input_tensor, ilens)
NOTE
Ensure that the input tensor has the correct shape based on whether it is real or complex. The function checks if the input is complex and processes it accordingly.
- Raises:ValueError – If the mode is not one of (“mapping”, “masking”).
property num_spk