espnet2.enh.separator.svoice_separator.Encoder

About 1 min

espnet2.enh.separator.svoice_separator.Encoder

class espnet2.enh.separator.svoice_separator.Encoder(enc_kernel_size: int, enc_feat_dim: int)

Bases: Module

Encoder module for processing input signals.

This module utilizes a 1D convolutional layer followed by a ReLU activation function to transform the input mixture signal into a feature representation.

conv

A convolutional layer that applies a 1D convolution to the input signal.

Type: nn.Conv1d

nonlinear

A ReLU activation function applied to the output of the convolutional layer.

Type: nn.ReLU
Parameters:
- enc_kernel_size (int) – The size of the kernel used in the convolutional layer.
- enc_feat_dim (int) – The dimension of the feature output from the encoder.

####### Examples

>>> encoder = Encoder(enc_kernel_size=8, enc_feat_dim=128)
>>> mixture = torch.randn(10, 160)  # Example batch of signals
>>> output = encoder(mixture)
>>> output.shape
torch.Size([10, 128, 80])  # Example output shape

NOTE

The input mixture signal should have a shape of [batch_size, signal_length].

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(mixture)

Performs the forward pass of the SVoiceSeparator model.

This method processes the input tensor through the encoder, RNN model, and decoder to separate audio sources.

Parameters:
- input (torch.Tensor or ComplexTensor) – Encoded feature of shape [B, T, N], where B is the batch size, T is the time dimension, and N is the number of frequency bins.
- ilens (torch.Tensor) – A tensor of shape [Batch] representing the input lengths for each instance in the batch.
- additional (Dict or None) – Optional dictionary containing other data included in the model. NOTE: This parameter is not used in this model.
Returns: A list of tensors : with shape [(B, T, N), …] representing the separated sources.
ilens (torch.Tensor): A tensor of shape (B,) representing the lengths : of the input sequences.
others (OrderedDict): A dictionary containing additional predicted data, : such as masks for each speaker:
- ‘mask_spk1’: torch.Tensor(Batch, Frames, Freq),
- ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq),
- …
- ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq).
Return type: masked (List[Union(torch.Tensor, ComplexTensor)])

####### Examples

>>> model = SVoiceSeparator(input_dim=256, enc_dim=128, kernel_size=8)
>>> input_tensor = torch.randn(2, 100, 256)  # Batch of 2
>>> ilens = torch.tensor([100, 90])  # Input lengths
>>> outputs, lengths, masks = model(input_tensor, ilens)

NOTE

The time dimension might change due to convolution operations.