espnet2.enh.separator.svoice_separator.Decoder

About 1 min

espnet2.enh.separator.svoice_separator.Decoder

class espnet2.enh.separator.svoice_separator.Decoder(kernel_size)

Bases: Module

Decoder module for reconstructing audio signals from estimated sources.

The Decoder takes the estimated source signals and applies an average pooling operation followed by an overlap-and-add procedure to reconstruct the time-domain signal from its framed representation.

Parameters:kernel_size (int) – The size of the kernel used for the average pooling operation.
Returns: The reconstructed time-domain signal from the estimated sources.
Return type: torch.Tensor

####### Examples

>>> decoder = Decoder(kernel_size=8)
>>> est_source = torch.rand(1, 2, 10, 8)  # Example estimated source tensor
>>> reconstructed_signal = decoder(est_source)
>>> print(reconstructed_signal.shape)
torch.Size([1, output_length])  # The output shape will depend on the input size.

NOTE

The overlap-and-add method requires that the kernel size is greater than zero.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(est_source)

Perform a forward pass through the SVoice separator model.

This method processes the input tensor through the encoder, applies a dual-path RNN model for separation, and decodes the output to produce separated speech signals for multiple speakers.

Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature tensor of shape [B, T, N], where B is the batch size, T is the number of time frames, and N is the number of frequency bins.
- ilens (torch.Tensor) – Input lengths tensor of shape [Batch], indicating the length of each input sequence in the batch.
- additional (Dict or None) – A dictionary containing other data included in the model. This argument is not used in this model.
Returns: A list of : tensors, each representing the separated output for a speaker, in the shape [(B, T, N), …].
ilens (torch.Tensor): A tensor of shape (B,) containing the : lengths of the input sequences.
others (OrderedDict): An ordered dictionary containing predicted : data, such as masks for each speaker:
- ‘mask_spk1’: torch.Tensor(Batch, Frames, Freq)
- ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq)
- …
- ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq)
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])

####### Examples

>>> model = SVoiceSeparator(input_dim=128, enc_dim=128, kernel_size=8)
>>> input_tensor = torch.randn(2, 100, 128)  # Example input
>>> ilens = torch.tensor([100, 100])  # Example input lengths
>>> outputs, lengths, masks = model(input_tensor, ilens)

NOTE

The time dimension of the input may be altered due to convolution operations. Ensure that the output is padded back to the original length for proper alignment.