espnet2.enh.separator.svoice_separator.Decoder
espnet2.enh.separator.svoice_separator.Decoder
class espnet2.enh.separator.svoice_separator.Decoder(kernel_size)
Bases: Module
Decoder module for reconstructing audio signals from estimated sources.
The Decoder takes the estimated source signals and applies an average pooling operation followed by an overlap-and-add procedure to reconstruct the time-domain signal from its framed representation.
- Parameters:kernel_size (int) – The size of the kernel used for the average pooling operation.
- Returns: The reconstructed time-domain signal from the estimated sources.
- Return type: torch.Tensor
####### Examples
>>> decoder = Decoder(kernel_size=8)
>>> est_source = torch.rand(1, 2, 10, 8) # Example estimated source tensor
>>> reconstructed_signal = decoder(est_source)
>>> print(reconstructed_signal.shape)
torch.Size([1, output_length]) # The output shape will depend on the input size.
NOTE
The overlap-and-add method requires that the kernel size is greater than zero.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(est_source)
Perform a forward pass through the SVoice separator model.
This method processes the input tensor through the encoder, applies a dual-path RNN model for separation, and decodes the output to produce separated speech signals for multiple speakers.
Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature tensor of shape [B, T, N], where B is the batch size, T is the number of time frames, and N is the number of frequency bins.
- ilens (torch.Tensor) – Input lengths tensor of shape [Batch], indicating the length of each input sequence in the batch.
- additional (Dict or None) – A dictionary containing other data included in the model. This argument is not used in this model.
Returns: A list of : tensors, each representing the separated output for a speaker, in the shape [(B, T, N), …].
ilens (torch.Tensor): A tensor of shape (B,) containing the : lengths of the input sequences.
others (OrderedDict): An ordered dictionary containing predicted : data, such as masks for each speaker:
- ‘mask_spk1’: torch.Tensor(Batch, Frames, Freq)
- ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq)
- …
- ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq)
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])
####### Examples
>>> model = SVoiceSeparator(input_dim=128, enc_dim=128, kernel_size=8)
>>> input_tensor = torch.randn(2, 100, 128) # Example input
>>> ilens = torch.tensor([100, 100]) # Example input lengths
>>> outputs, lengths, masks = model(input_tensor, ilens)
NOTE
The time dimension of the input may be altered due to convolution operations. Ensure that the output is padded back to the original length for proper alignment.