espnet2.enh.layers.ifasnet.iFaSNet

About 3 min

espnet2.enh.layers.ifasnet.iFaSNet

class espnet2.enh.layers.ifasnet.iFaSNet(*args, **kwargs)

Bases: FaSNet_base

Implicit Filter-and-sum Network (iFaSNet) for multi-channel speech separation.

This model is based on the work by Luo et al. in “Implicit Filter-and-sum Network for Multi-channel Speech Separation”. It utilizes a context-aware architecture to improve speech separation quality by considering both past and future signals.

context

The number of context frames used in processing.

Type: int

summ_BN

Linear layer for context compression.

Type: nn.Linear

summ_RNN

RNN layer for context summarization.

Type:dprnn.SingleRNN

summ_LN

Layer normalization for summarization.

Type: nn.GroupNorm

summ_output

Linear layer for output generation.

Type: nn.Linear

separator

The core separator module.

Type:BF_module

encoder

Convolutional layer for encoding the input.

Type: nn.Conv1d

decoder

Transpose convolutional layer for decoding.

Type: nn.ConvTranspose1d

enc_LN

Layer normalization for encoder outputs.

Type: nn.GroupNorm

gen_BN

Convolutional layer for generating filters.

Type: nn.Conv1d

gen_RNN

RNN layer for generating filters.

Type:dprnn.SingleRNN

gen_LN

Layer normalization for filter generation.

Type: nn.GroupNorm

gen_output

Convolutional layer for final output.

Type: nn.Conv1d
Parameters:
- *args – Variable length argument list for base class initialization.
- **kwargs – Keyword arguments for initializing the base class.
Returns: The separated audio signals for each speaker.
Return type: Tensor
Raises:ValueError – If the input dimensions are not compatible with the model.

####### Examples

>>> model = iFaSNet(enc_dim=64, feature_dim=64, hidden_dim=128,
...                 layer=6, segment_size=24, nspk=2,
...                 win_len=16, context_len=16, sr=16000)
>>> input_tensor = torch.rand(3, 4, 32000)  # (batch, num_mic, length)
>>> num_mic = torch.tensor([3, 3, 2])
>>> output = model(input_tensor, num_mic.long())
>>> print(output.shape)  # (batch, nspk, length)

NOTE

This implementation is based on the repository: https://github.com/yluo42/TAC and is licensed under CC BY-NC-SA 3.0 US.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, num_mic)

Implicit Filter-and-sum Network for Multi-channel Speech Separation.

This class implements the iFaSNet architecture as described in Luo et al. The model is designed for multi-channel speech separation using implicit filter-and-sum techniques. The implementation is based on the repository: https://github.com/yluo42/TAC and is licensed under CC BY-NC-SA 3.0 US.

context

The context length used for processing.

Type: int

summ_BN

A linear layer for context compression.

Type: nn.Linear

summ_RNN

A bidirectional RNN for context processing.

Type:dprnn.SingleRNN

summ_LN

Group normalization layer.

Type: nn.GroupNorm

summ_output

Linear layer for final output.

Type: nn.Linear

separator

The separation module used in the network.

Type:BF_module

encoder

Convolutional layer for waveform encoding.

Type: nn.Conv1d

decoder

Transpose convolutional layer for decoding.

Type: nn.ConvTranspose1d

enc_LN

Group normalization layer for encoder output.

Type: nn.GroupNorm

gen_BN

Convolutional layer for context decompression.

Type: nn.Conv1d

gen_RNN

A bidirectional RNN for generating output.

Type:dprnn.SingleRNN

gen_LN

Group normalization layer for generated output.

Type: nn.GroupNorm

gen_output

Convolutional layer for final output generation.

Type: nn.Conv1d
Parameters:
- *args – Variable length argument list for the base class.
- **kwargs – Keyword arguments for the base class, including: enc_dim (int): Dimension of the encoder. feature_dim (int): Dimension of the features. hidden_dim (int): Dimension of the hidden layers. layer (int): Number of layers in the RNN. segment_size (int): Size of the segments for processing. nspk (int): Number of speakers. win_len (int): Length of the window. context_len (int): Length of the context. sr (int): Sampling rate.
Returns: The separated speech signals of shape (batch, nspk, T).
Return type: Tensor
Raises:ValueError – If input dimensions are not as expected.

####### Examples

>>> model = iFaSNet(enc_dim=64, feature_dim=64, hidden_dim=128,
...                  layer=6, segment_size=24, nspk=2,
...                  win_len=16, context_len=16, sr=16000)
>>> input_data = torch.rand(3, 4, 32000)  # (batch, num_mic, length)
>>> num_mic = torch.tensor([3, 3, 2])
>>> output = model(input_data, num_mic)
>>> print(output.shape)  # (batch, nspk, length)

NOTE

The model expects input tensors with specific dimensions, and it is important to ensure that the number of microphones and the length of the input match the model’s expectations.