espnet2.enh.separator.fasnet_separator.FaSNetSeparator

About 2 min

espnet2.enh.separator.fasnet_separator.FaSNetSeparator

class espnet2.enh.separator.fasnet_separator.FaSNetSeparator(input_dim: int, enc_dim: int, feature_dim: int, hidden_dim: int, layer: int, segment_size: int, num_spk: int, win_len: int, context_len: int, fasnet_type: str, dropout: float = 0.0, sr: int = 16000, predict_noise: bool = False)

Bases: AbsSeparator

FaSNetSeparator is a Filter-and-sum Network (FaSNet) Separator that inherits

from the AbsSeparator class. This model is designed for separating audio signals based on the specified number of speakers and supports both the original FaSNet and the Implicit FaSNet architectures.

num_spk

The number of speakers.

Type: int
Parameters:
- input_dim (int) – Required by AbsSeparator. Not used in this model.
- enc_dim (int) – Encoder dimension.
- feature_dim (int) – Feature dimension.
- hidden_dim (int) – Hidden dimension in DPRNN.
- layer (int) – Number of DPRNN blocks in iFaSNet.
- segment_size (int) – Dual-path segment size.
- num_spk (int) – Number of speakers.
- win_len (int) – Window length in milliseconds.
- context_len (int) – Context length in milliseconds.
- fasnet_type (str) – ‘fasnet’ or ‘ifasnet’. Select from origin fasnet or Implicit fasnet.
- dropout (float , optional) – Dropout rate. Default is 0.0.
- sr (int , optional) – Sample rate of input audio. Default is 16000.
- predict_noise (bool , optional) – Whether to output the estimated noise signal. Default is False.

forward(input

torch.Tensor, ilens: torch.Tensor, : additional: Optional[Dict] = None) -> Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]:

Performs the forward pass of the model.

Returns: A tuple containing the separated audio signals, input lengths, and other predicted data (e.g., masks).
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]
Raises:
- AssertionError – If the input tensor does not have the expected shape
- (Batch, samples**,** channels**)****.** –

####### Examples

Initialize the separator

separator = FaSNetSeparator(

input_dim=1, enc_dim=256, feature_dim=256, hidden_dim=512, layer=6, segment_size=10, num_spk=2, win_len=20, context_len=10, fasnet_type=’fasnet’, dropout=0.1, sr=16000, predict_noise=True

)

Forward pass

input_tensor = torch.randn(4, 16000, 1) # (Batch, samples, channels) ilens = torch.tensor([16000, 16000, 16000, 16000]) # Input lengths separated, lengths, others = separator.forward(input_tensor, ilens)

Filter-and-sum Network (FaSNet) Separator

Parameters:
- input_dim – required by AbsSeparator. Not used in this model.
- enc_dim – encoder dimension
- feature_dim – feature dimension
- hidden_dim – hidden dimension in DPRNN
- layer – number of DPRNN blocks in iFaSNet
- segment_size – dual-path segment size
- num_spk – number of speakers
- win_len – window length in millisecond
- context_len – context length in millisecond
- fasnet_type – ‘fasnet’ or ‘ifasnet’. Select from origin fasnet or Implicit fasnet
- dropout – dropout rate. Default is 0.
- sr – samplerate of input audio
- predict_noise – whether to output the estimated noise signal

forward(input

: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]

Perform the forward pass of the FaSNet separator.

This method processes the input audio tensor to separate the sources based on the model architecture defined during initialization. It takes in the input audio signal and its corresponding lengths, and returns the separated sources along with their lengths and additional predicted data such as masks.

Parameters:
- input (torch.Tensor) – A tensor of shape (Batch, samples, channels) representing the input audio signals.
- ilens (torch.Tensor) – A tensor of shape (Batch,) containing the lengths of each input signal in the batch.
- additional (Dict or None) – A dictionary for any additional data that may be included in the model. Note that this parameter is not used in this implementation.
Returns: A tuple containing:
- separated (List[torch.Tensor]): A list of tensors where each tensor represents the separated audio for each speaker. Shape: [(B, T, N), …]
- ilens (torch.Tensor): A tensor of shape (B,) containing the lengths of the separated signals.
- others (OrderedDict): A dictionary containing predicted data, e.g., masks for each speaker:
  - ’mask_spk1’: torch.Tensor(Batch, Frames, Freq),
  - ’mask_spk2’: torch.Tensor(Batch, Frames, Freq),
  - …
  - ’mask_spkn’: torch.Tensor(Batch, Frames, Freq).
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]
Raises:AssertionError – If the input tensor does not have 3 dimensions.

####### Examples

>>> separator = FaSNetSeparator(input_dim=1, enc_dim=128,
...                             feature_dim=256, hidden_dim=512,
...                             layer=6, segment_size=400,
...                             num_spk=2, win_len=25,
...                             context_len=50, fasnet_type='fasnet')
>>> input_tensor = torch.randn(8, 16000, 1)  # Batch of 8 audio signals
>>> input_lengths = torch.tensor([16000]*8)  # Lengths of each signal
>>> separated_sources, lengths, masks = separator.forward(input_tensor,
...                                                       input_lengths)

NOTE

Ensure that the input tensor has the correct shape and that the number of speakers is set appropriately during initialization.

property num_spk