espnet2.enh.separator.dan_separator.DANSeparator

About 2 min

espnet2.enh.separator.dan_separator.DANSeparator

class espnet2.enh.separator.dan_separator.DANSeparator(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)

Bases: AbsSeparator

DANSeparator is a Deep Attractor Network for single-microphone speaker

separation, which utilizes recurrent neural networks to estimate masks for each speaker in an audio mixture.

This model is based on the research paper:

DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION;

Zhuo Chen et al., 2017; https://pubmed.ncbi.nlm.nih.gov/29430212/

num_spk

The number of speakers to separate.

Type: int
Parameters:
- input_dim (int) – Input feature dimension.
- rnn_type (str) – Type of RNN, options include ‘blstm’, ‘lstm’, etc.
- num_spk (int) – Number of speakers to separate. Default is 2.
- nonlinear (str) – Nonlinear function for mask estimation. Options include ‘relu’, ‘tanh’, ‘sigmoid’. Default is ‘tanh’.
- layer (int) – Number of stacked RNN layers. Default is 2.
- unit (int) – Dimension of the hidden state. Default is 512.
- emb_D (int) – Dimension of the attribute vector for one time-frequency bin. Default is 40.
- dropout (float) – Dropout ratio. Default is 0.0.
Raises:ValueError – If the nonlinear function is not one of the supported types.

####### Examples

separator = DANSeparator(input_dim=80, num_spk=2, rnn_type=’blstm’) input_tensor = torch.randn(10, 100, 80) # Batch of 10, 100 time steps, 80 features ilens = torch.tensor([100] * 10) # Input lengths for each batch masked, ilens_out, others = separator(input_tensor, ilens)

masked will contain the separated signals for each speaker

ilens_out will contain the lengths of the output signals

others will contain the predicted masks for each speaker

Deep Attractor Network Separator

Reference: : DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION; Zhuo Chen. et al., 2017; https://pubmed.ncbi.nlm.nih.gov/29430212/

Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- unit – int, dimension of the hidden state.
- emb_D – int, dimension of the attribute vector for one tf-bin.
- dropout – float, dropout ratio. Default is 0.

forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]

Forward pass for the Deep Attractor Network (DAN) separator.

This method processes the input audio features and computes the estimated masks for each speaker. It utilizes a recurrent neural network (RNN) to generate embeddings, from which attractors are derived to separate the speakers’ contributions in the mixed signal.

Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature tensor of shape [B, T, F], where B is the batch size, T is the number of time frames, and F is the number of frequency bins.
- ilens (torch.Tensor) – A tensor containing the lengths of the input sequences for each batch element, shape [Batch].
- additional (Optional *[*Dict ] , optional) –
  A dictionary containing additional data that may be used in the model. For example, it may include:
  - “feature_ref”: List of reference spectra of shape
  List[(B, T, F)].
Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, : > OrderedDict]:
- masked (List[Union[torch.Tensor, ComplexTensor]]): A list of tensors, each of shape [(B, T, N), …], where N corresponds to the number of speakers.
- ilens (torch.Tensor): A tensor of shape (B,) representing the lengths of the input sequences.
- others (OrderedDict): A dictionary containing predicted data, such as masks for each speaker, with the following structure: OrderedDict[
  ’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq), <br/> ]

####### Examples

>>> model = DANSeparator(input_dim=256)
>>> input_tensor = torch.randn(10, 100, 256)  # Example input
>>> ilens = torch.tensor([100] * 10)  # All sequences are of length 100
>>> masked, ilens_out, others = model.forward(input_tensor, ilens)

NOTE

Ensure that the input feature tensor and ilens are properly aligned and of correct dimensions.

Raises:
- ValueError – If the input nonlinear activation function is not one of
- 'sigmoid'****, 'relu'****, or 'tanh'. –

property num_spk