espnet2.enh.separator.svoice_separator.SVoiceSeparator
espnet2.enh.separator.svoice_separator.SVoiceSeparator
class espnet2.enh.separator.svoice_separator.SVoiceSeparator(input_dim: int, enc_dim: int, kernel_size: int, hidden_size: int, num_spk: int = 2, num_layers: int = 4, segment_size: int = 20, bidirectional: bool = True, input_normalize: bool = False)
Bases: AbsSeparator
SVoice model for speech separation.
This model implements the SVoice architecture for separating multiple speakers from a mixed audio input. It utilizes an encoder-decoder structure combined with a dual-path RNN model to effectively process audio signals with an unknown number of speakers.
Reference: : Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531
- Parameters:
- input_dim (int) – Dimension of the input features.
- enc_dim (int) – Dimension of the encoder module’s output. (Default: 128)
- kernel_size (int) – The kernel size of Conv1D layer in both encoder and decoder modules. (Default: 8)
- hidden_size (int) – Dimension of the hidden state in RNN layers. (Default: 128)
- num_spk (int) – The number of speakers in the output. (Default: 2)
- num_layers (int) – Number of stacked MulCat blocks. (Default: 4)
- segment_size (int) – Dual-path segment size. (Default: 20)
- bidirectional (bool) – Whether the RNN layers are bidirectional. (Default: True)
- input_normalize (bool) – Whether to apply GroupNorm on the input Tensor. (Default: False)
- Returns: A tuple containing: : - masked (List[Union(torch.Tensor, ComplexTensor)]): A list of tensors representing the separated sources for each speaker, with shape [(B, T, N), …].
- ilens (torch.Tensor): A tensor representing the lengths of the input sequences, with shape (B,).
- others (OrderedDict): A dictionary containing additional predicted data, such as masks for each speaker: {
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq), <br/> }
- Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]
####### Examples
>>> separator = SVoiceSeparator(input_dim=256, enc_dim=128, kernel_size=8)
>>> input_tensor = torch.randn(10, 100, 256) # Batch of 10, 100 time steps
>>> input_lengths = torch.tensor([100] * 10) # All sequences have length 100
>>> outputs, lengths, masks = separator(input_tensor, input_lengths)
NOTE
The additional argument is not used in this model but is included for compatibility with the general interface of the separator class.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]
SVoice model for speech separation.
This class implements the SVoice model, which is designed for separating speech from multiple speakers in a given audio input. It utilizes an encoder-decoder architecture with a recurrent neural network (RNN) for effective separation.
Reference: : Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531
enc_dim
Dimension of the encoder module’s output.
- Type: int
kernel_size
The kernel size of Conv1D layer in both encoder and decoder modules.
- Type: int
hidden_size
Dimension of the hidden state in RNN layers.
- Type: int
num_spk
The number of speakers in the output.
- Type: int
num_layers
Number of stacked MulCat blocks.
- Type: int
segment_size
Dual-path segment size.
- Type: int
bidirectional
Whether the RNN layers are bidirectional.
- Type: bool
input_normalize
Whether to apply GroupNorm on the input Tensor.
Type: bool
Parameters:
- input_dim (int) – Dimension of the input feature.
- enc_dim (int) – Dimension of the encoder module’s output.
- kernel_size (int) – The kernel size of Conv1D layer in both encoder and decoder modules.
- hidden_size (int) – Dimension of the hidden state in RNN layers.
- num_spk (int , optional) – The number of speakers in the output. (Default: 2)
- num_layers (int , optional) – Number of stacked MulCat blocks. (Default: 4)
- segment_size (int , optional) – Dual-path segment size. (Default: 20)
- bidirectional (bool , optional) – Whether the RNN layers are bidirectional. (Default: True)
- input_normalize (bool , optional) – Whether to apply GroupNorm on the input Tensor. (Default: False)
Returns:
- masked: List of tensors containing separated audio signals for each speaker.
- ilens: Tensor containing the lengths of the input sequences.
- others: An OrderedDict containing any additional predicted data such as masks for each speaker.
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]
####### Examples
>>> model = SVoiceSeparator(input_dim=512, enc_dim=128,
... kernel_size=8, hidden_size=128)
>>> input_tensor = torch.randn(2, 100, 512) # Batch of 2, 100 time steps
>>> ilens = torch.tensor([100, 100]) # Lengths of each input
>>> outputs, ilens, others = model(input_tensor, ilens)
NOTE
The additional argument is not used in this model but is included for compatibility with other models.
property num_spk