espnet2.enh.separator.dccrn_separator.DCCRNSeparator

About 5 min

espnet2.enh.separator.dccrn_separator.DCCRNSeparator

class espnet2.enh.separator.dccrn_separator.DCCRNSeparator(input_dim: int, num_spk: int = 1, rnn_layer: int = 2, rnn_units: int = 256, masking_mode: str = 'E', use_clstm: bool = True, bidirectional: bool = False, use_cbn: bool = False, kernel_size: int = 5, kernel_num: List[int] = [32, 64, 128, 256, 256, 256], use_builtin_complex: bool = True, use_noise_mask: bool = False)

Bases: AbsSeparator

DCCRN Separator for speech separation tasks.

This class implements the DCCRN (Deep Complex Convolutional Recurrent Network) architecture for separating mixed audio signals into individual sources using complex convolutional and recurrent neural networks.

use_builtin_complex

Flag to determine whether to use torch.complex or ComplexTensor for complex operations.

Type: bool

_num_spk

Number of speakers to separate.

Type: int

use_noise_mask

Flag to indicate if noise mask estimation should be performed.

Type: bool

predict_noise

Flag to determine if noise prediction is enabled.

Type: bool

rnn_units

Number of units in the recurrent layers.

Type: int

hidden_layers

Number of LSTM layers in the CRN.

Type: int

kernel_size

Size of the convolutional kernels.

Type: int

kernel_num

Number of output channels for each encoder layer.

Type: list

masking_mode

Mode of mask application (C, E, R).

Type: str

use_clstm

Flag to indicate if complex LSTM should be used.

Type: bool
Parameters:
- input_dim (int) – Input dimension.
- num_spk (int , optional) – Number of speakers. Defaults to 1.
- rnn_layer (int , optional) – Number of LSTM layers in the CRN. Defaults to 2.
- rnn_units (int , optional) – Number of RNN units. Defaults to 256.
- masking_mode (str , optional) – Usage of the estimated mask. Defaults to “E”.
- use_clstm (bool , optional) – Whether to use complex LSTM. Defaults to True.
- bidirectional (bool , optional) – Whether to use bidirectional LSTM. Defaults to False.
- use_cbn (bool , optional) – Whether to use complex batch normalization. Defaults to False.
- kernel_size (int , optional) – Convolution kernel size. Defaults to 5.
- kernel_num (list , optional) – Output dimension of each layer of the encoder. Defaults to [32, 64, 128, 256, 256, 256].
- use_builtin_complex (bool , optional) – Use torch.complex if True, else ComplexTensor. Defaults to True.
- use_noise_mask (bool , optional) – Whether to estimate the mask of noise. Defaults to False.
Raises:ValueError – If the masking mode is unsupported.

############# Examples

>>> separator = DCCRNSeparator(input_dim=256, num_spk=2)
>>> input_tensor = torch.randn(10, 20, 256)  # Batch of 10, 20 time frames
>>> ilens = torch.tensor([20] * 10)  # All inputs are of length 20
>>> masked, ilens, others = separator(input_tensor, ilens)
>>> print(masked)  # Output will be a list of tensors for each speaker

######### NOTE This implementation is designed to work with complex-valued inputs and outputs, and it may require specific versions of PyTorch for optimal performance.

DCCRN separator.

Parameters:
- input_dim (int) – input dimension。
- num_spk (int , optional) – number of speakers. Defaults to 1.
- rnn_layer (int , optional) – number of lstm layers in the crn. Defaults to 2.
- rnn_units (int , optional) – rnn units. Defaults to 128.
- masking_mode (str , optional) – usage of the estimated mask. Defaults to “E”.
- use_clstm (bool , optional) – whether use complex LSTM. Defaults to False.
- bidirectional (bool , optional) – whether use BLSTM. Defaults to False.
- use_cbn (bool , optional) – whether use complex BN. Defaults to False.
- kernel_size (int , optional) – convolution kernel size. Defaults to 5.
- kernel_num (list , optional) – output dimension of each layer of the encoder.
- use_builtin_complex (bool , optional) – torch.complex if True, else ComplexTensor.
- use_noise_mask (bool , optional) – whether to estimate the mask of noise.

apply_masks(masks: List[Tensor | ComplexTensor], real: Tensor, imag: Tensor)

Apply estimated masks to the real and imaginary parts of the noisy spectrum.

This method processes the estimated masks for each speaker and applies them to the noisy spectrogram, modifying the real and imaginary components based on the specified masking mode. It supports different masking techniques, allowing for flexible enhancement of the input signal.

Parameters:
- masks (List *[*Union *[*torch.Tensor , ComplexTensor ] ]) – A list of estimated masks, each with shape (B, T, F), where B is the batch size, T is the time dimension, and F is the frequency dimension.
- real (torch.Tensor) – The real part of the noisy spectrum with shape (B, F, T).
- imag (torch.Tensor) – The imaginary part of the noisy spectrum with shape (B, F, T).
Returns: A list of masked outputs, each with shape (B, T, F).
Return type: List[Union[torch.Tensor, ComplexTensor]]

############# Examples

Assuming masks, real, and imag are predefined tensors

masked_outputs = apply_masks(masks, real, imag)

masked_outputs will contain the processed tensors based on the masks

applied to the real and imaginary parts of the input spectrum.

######### NOTE The masking modes supported are: : - “E”: Estimate using the magnitude and phase.

“C”: Combine using complex multiplication.
“R”: Apply the mask to the real and imaginary parts independently.

create_masks(mask_tensor: Tensor)

Create estimated mask for each speaker.

This method processes the output from the decoder to generate masks for each speaker based on the given mask tensor. The masks can be used to separate audio signals of multiple speakers.

Parameters:mask_tensor (torch.Tensor) – Output of decoder with shape (B, 2*num_spk, F-1, T). The tensor should contain complex-valued representations of the estimated masks for each speaker.
Returns: A list of estimated masks, where each mask has the shape (B, T, F) for each speaker.
Return type: List[Union[torch.Tensor, ComplexTensor]]
Raises:
- AssertionError – If the shape of mask_tensor does not match the
- expected dimensions based on the use_noise_mask flag. –

############# Examples

>>> separator = DCCRNSeparator(input_dim=256, num_spk=2)
>>> mask_tensor = torch.randn(4, 4, 128, 100)  # Example tensor
>>> masks = separator.create_masks(mask_tensor)
>>> for mask in masks:
...     print(mask.shape)  # Each mask shape should be (B, T, F)

######### NOTE The method checks the number of output channels in the mask tensor against the expected number of speakers and raises an assertion error if there is a mismatch.

flatten_parameters()

Flatten the parameters of the RNN for optimized performance.

This method is specifically useful when using LSTM layers, as it ensures that the internal states of the LSTM are contiguous in memory, which can improve the performance of the forward pass.

######### NOTE This method should be called before invoking the forward pass when using LSTM layers to ensure optimal performance.

Raises:ValueError – If the enhance layer is not an instance of nn.LSTM.

############# Examples

>>> model = DCCRNSeparator(input_dim=128, num_spk=2)
>>> model.flatten_parameters()

forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]

Forward pass through the DCCRN separator.

This method takes encoded features and performs a forward pass through the network to separate the sources. It applies complex operations and returns the estimated masks for each speaker.

Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature tensor of shape [B, T, F], where B is the batch size, T is the number of time frames, and F is the number of frequency bins.
- ilens (torch.Tensor) – Input lengths tensor of shape [Batch] indicating the valid lengths of the input sequences.
- additional (Dict or None) – Additional data that can be included in the model. NOTE: This parameter is not used in this model.
Returns: A list of masked output : tensors, each of shape [(B, T, F), …] for the separated sources.
ilens (torch.Tensor): Tensor of shape (B,) containing the input lengths. others (OrderedDict): An ordered dictionary containing predicted masks for
each speaker, e.g.: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
Return type: masked (List[Union[torch.Tensor, ComplexTensor]])

############# Examples

>>> model = DCCRNSeparator(input_dim=256, num_spk=2)
>>> input_tensor = torch.randn(10, 100, 256)  # Example input
>>> ilens = torch.tensor([100] * 10)  # All inputs are of length 100
>>> masked, ilens_out, masks = model(input_tensor, ilens)

######### NOTE The method relies on internal operations to reshape and permute tensors for processing through the encoder, RNN layers, and decoder. It is designed to handle both real and complex tensors.

Raises:ValueError – If the masking mode is unsupported.

property num_spk