espnet2.enh.separator.svoice_separator.SVoiceSeparator

About 3 min

espnet2.enh.separator.svoice_separator.SVoiceSeparator

class espnet2.enh.separator.svoice_separator.SVoiceSeparator(input_dim: int, enc_dim: int, kernel_size: int, hidden_size: int, num_spk: int = 2, num_layers: int = 4, segment_size: int = 20, bidirectional: bool = True, input_normalize: bool = False)

Bases: AbsSeparator

SVoice model for speech separation.

This model implements the SVoice architecture for separating multiple speakers from a mixed audio input. It utilizes an encoder-decoder structure combined with a dual-path RNN model to effectively process audio signals with an unknown number of speakers.

Reference: : Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531

Parameters:
- input_dim (int) – Dimension of the input features.
- enc_dim (int) – Dimension of the encoder module’s output. (Default: 128)
- kernel_size (int) – The kernel size of Conv1D layer in both encoder and decoder modules. (Default: 8)
- hidden_size (int) – Dimension of the hidden state in RNN layers. (Default: 128)
- num_spk (int) – The number of speakers in the output. (Default: 2)
- num_layers (int) – Number of stacked MulCat blocks. (Default: 4)
- segment_size (int) – Dual-path segment size. (Default: 20)
- bidirectional (bool) – Whether the RNN layers are bidirectional. (Default: True)
- input_normalize (bool) – Whether to apply GroupNorm on the input Tensor. (Default: False)
Returns: A tuple containing: : - masked (List[Union(torch.Tensor, ComplexTensor)]): A list of tensors representing the separated sources for each speaker, with shape [(B, T, N), …].
- ilens (torch.Tensor): A tensor representing the lengths of the input sequences, with shape (B,).
- others (OrderedDict): A dictionary containing additional predicted data, such as masks for each speaker: {
  ’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq), <br/> }
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]

####### Examples

>>> separator = SVoiceSeparator(input_dim=256, enc_dim=128, kernel_size=8)
>>> input_tensor = torch.randn(10, 100, 256)  # Batch of 10, 100 time steps
>>> input_lengths = torch.tensor([100] * 10)  # All sequences have length 100
>>> outputs, lengths, masks = separator(input_tensor, input_lengths)

NOTE

The additional argument is not used in this model but is included for compatibility with the general interface of the separator class.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]

SVoice model for speech separation.

This class implements the SVoice model, which is designed for separating speech from multiple speakers in a given audio input. It utilizes an encoder-decoder architecture with a recurrent neural network (RNN) for effective separation.

Reference: : Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531

enc_dim

Dimension of the encoder module’s output.

Type: int

kernel_size

The kernel size of Conv1D layer in both encoder and decoder modules.

Type: int

hidden_size

Dimension of the hidden state in RNN layers.

Type: int

num_spk

The number of speakers in the output.

Type: int

num_layers

Number of stacked MulCat blocks.

Type: int

segment_size

Dual-path segment size.

Type: int

bidirectional

Whether the RNN layers are bidirectional.

Type: bool

input_normalize

Whether to apply GroupNorm on the input Tensor.

Type: bool
Parameters:
- input_dim (int) – Dimension of the input feature.
- enc_dim (int) – Dimension of the encoder module’s output.
- kernel_size (int) – The kernel size of Conv1D layer in both encoder and decoder modules.
- hidden_size (int) – Dimension of the hidden state in RNN layers.
- num_spk (int , optional) – The number of speakers in the output. (Default: 2)
- num_layers (int , optional) – Number of stacked MulCat blocks. (Default: 4)
- segment_size (int , optional) – Dual-path segment size. (Default: 20)
- bidirectional (bool , optional) – Whether the RNN layers are bidirectional. (Default: True)
- input_normalize (bool , optional) – Whether to apply GroupNorm on the input Tensor. (Default: False)
Returns:
- masked: List of tensors containing separated audio signals for each speaker.
- ilens: Tensor containing the lengths of the input sequences.
- others: An OrderedDict containing any additional predicted data such as masks for each speaker.
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]

####### Examples

>>> model = SVoiceSeparator(input_dim=512, enc_dim=128,
...                         kernel_size=8, hidden_size=128)
>>> input_tensor = torch.randn(2, 100, 512)  # Batch of 2, 100 time steps
>>> ilens = torch.tensor([100, 100])  # Lengths of each input
>>> outputs, ilens, others = model(input_tensor, ilens)

NOTE

The additional argument is not used in this model but is included for compatibility with other models.

property num_spk