espnet2.enh.layers.ifasnet.iFaSNet
espnet2.enh.layers.ifasnet.iFaSNet
class espnet2.enh.layers.ifasnet.iFaSNet(*args, **kwargs)
Bases: FaSNet_base
Implicit Filter-and-sum Network (iFaSNet) for multi-channel speech separation.
This model is based on the work by Luo et al. in “Implicit Filter-and-sum Network for Multi-channel Speech Separation”. It utilizes a context-aware architecture to improve speech separation quality by considering both past and future signals.
context
The number of context frames used in processing.
- Type: int
summ_BN
Linear layer for context compression.
- Type: nn.Linear
summ_RNN
RNN layer for context summarization.
- Type:dprnn.SingleRNN
summ_LN
Layer normalization for summarization.
- Type: nn.GroupNorm
summ_output
Linear layer for output generation.
- Type: nn.Linear
separator
The core separator module.
- Type:BF_module
encoder
Convolutional layer for encoding the input.
- Type: nn.Conv1d
decoder
Transpose convolutional layer for decoding.
- Type: nn.ConvTranspose1d
enc_LN
Layer normalization for encoder outputs.
- Type: nn.GroupNorm
gen_BN
Convolutional layer for generating filters.
- Type: nn.Conv1d
gen_RNN
RNN layer for generating filters.
- Type:dprnn.SingleRNN
gen_LN
Layer normalization for filter generation.
- Type: nn.GroupNorm
gen_output
Convolutional layer for final output.
Type: nn.Conv1d
Parameters:
- *args – Variable length argument list for base class initialization.
- **kwargs – Keyword arguments for initializing the base class.
Returns: The separated audio signals for each speaker.
Return type: Tensor
Raises:ValueError – If the input dimensions are not compatible with the model.
####### Examples
>>> model = iFaSNet(enc_dim=64, feature_dim=64, hidden_dim=128,
... layer=6, segment_size=24, nspk=2,
... win_len=16, context_len=16, sr=16000)
>>> input_tensor = torch.rand(3, 4, 32000) # (batch, num_mic, length)
>>> num_mic = torch.tensor([3, 3, 2])
>>> output = model(input_tensor, num_mic.long())
>>> print(output.shape) # (batch, nspk, length)
NOTE
This implementation is based on the repository: https://github.com/yluo42/TAC and is licensed under CC BY-NC-SA 3.0 US.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input, num_mic)
Implicit Filter-and-sum Network for Multi-channel Speech Separation.
This class implements the iFaSNet architecture as described in Luo et al. The model is designed for multi-channel speech separation using implicit filter-and-sum techniques. The implementation is based on the repository: https://github.com/yluo42/TAC and is licensed under CC BY-NC-SA 3.0 US.
context
The context length used for processing.
- Type: int
summ_BN
A linear layer for context compression.
- Type: nn.Linear
summ_RNN
A bidirectional RNN for context processing.
- Type:dprnn.SingleRNN
summ_LN
Group normalization layer.
- Type: nn.GroupNorm
summ_output
Linear layer for final output.
- Type: nn.Linear
separator
The separation module used in the network.
- Type:BF_module
encoder
Convolutional layer for waveform encoding.
- Type: nn.Conv1d
decoder
Transpose convolutional layer for decoding.
- Type: nn.ConvTranspose1d
enc_LN
Group normalization layer for encoder output.
- Type: nn.GroupNorm
gen_BN
Convolutional layer for context decompression.
- Type: nn.Conv1d
gen_RNN
A bidirectional RNN for generating output.
- Type:dprnn.SingleRNN
gen_LN
Group normalization layer for generated output.
- Type: nn.GroupNorm
gen_output
Convolutional layer for final output generation.
Type: nn.Conv1d
Parameters:
- *args – Variable length argument list for the base class.
- **kwargs – Keyword arguments for the base class, including: enc_dim (int): Dimension of the encoder. feature_dim (int): Dimension of the features. hidden_dim (int): Dimension of the hidden layers. layer (int): Number of layers in the RNN. segment_size (int): Size of the segments for processing. nspk (int): Number of speakers. win_len (int): Length of the window. context_len (int): Length of the context. sr (int): Sampling rate.
Returns: The separated speech signals of shape (batch, nspk, T).
Return type: Tensor
Raises:ValueError – If input dimensions are not as expected.
####### Examples
>>> model = iFaSNet(enc_dim=64, feature_dim=64, hidden_dim=128,
... layer=6, segment_size=24, nspk=2,
... win_len=16, context_len=16, sr=16000)
>>> input_data = torch.rand(3, 4, 32000) # (batch, num_mic, length)
>>> num_mic = torch.tensor([3, 3, 2])
>>> output = model(input_data, num_mic)
>>> print(output.shape) # (batch, nspk, length)
NOTE
The model expects input tensors with specific dimensions, and it is important to ensure that the number of microphones and the length of the input match the model’s expectations.