espnet2.enh.separator.tfgridnetv2_separator.TFGridNetV2

About 4 min

espnet2.enh.separator.tfgridnetv2_separator.TFGridNetV2

class espnet2.enh.separator.tfgridnetv2_separator.TFGridNetV2(input_dim, n_srcs=2, n_fft=128, stride=64, window='hann', n_imics=1, n_layers=6, lstm_hidden_units=192, attn_n_head=4, attn_approx_qk_dim=512, emb_dim=48, emb_ks=4, emb_hs=1, activation='prelu', eps=1e-05, use_builtin_complex=False)

Bases: AbsSeparator

Offline TFGridNetV2 for speech separation.

Compared to TFGridNet, TFGridNetV2 enhances performance by vectorizing multiple heads in self-attention and improving the handling of Deconv1D in each intra- and inter-block when emb_ks equals emb_hs.

References

[1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation”, in TASLP, 2023.

[2] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.

NOTE

This model performs optimally when trained with variance-normalized mixture inputs and targets. For a mixture tensor of shape [batch, samples, microphones], normalize it by dividing with torch.std(mixture, (1, 2)). Apply the same normalization to the target signals. This is particularly important when not using scale-invariant loss functions such as SI-SDR. Specifically, use:

std_

= std(mix) mix = mix /

std_

tgt = tgt /

std_

Parameters:
- input_dim (int) – Placeholder, not used.
- n_srcs (int) – Number of output sources/speakers.
- n_fft (int) – STFT window size.
- stride (int) – STFT stride.
- window (str) – STFT window type; choose between ‘hamming’, ‘hanning’, or None.
- n_imics (int) – Number of microphone channels (only fixed-array geometry supported).
- n_layers (int) – Number of TFGridNetV2 blocks.
- lstm_hidden_units (int) – Number of hidden units in LSTM.
- attn_n_head (int) – Number of heads in self-attention.
- attn_approx_qk_dim (int) – Approximate dimension of frame-level key and value tensors.
- emb_dim (int) – Embedding dimension.
- emb_ks (int) – Kernel size for unfolding and Deconv1D.
- emb_hs (int) – Hop size for unfolding and Deconv1D.
- activation (str) – Activation function to use in the entire TFGridNetV2 model. Can use any torch-supported activation (e.g., ‘relu’ or ‘elu’).
- eps (float) – Small epsilon for normalization layers.
- use_builtin_complex (bool) – Whether to use built-in complex type or not.

######### Examples

>>> model = TFGridNetV2(n_srcs=2, n_fft=256)
>>> input_tensor = torch.randn(8, 512, 1)  # [B, N, M]
>>> ilens = torch.tensor([512] * 8)  # input lengths
>>> enhanced, lengths, _ = model(input_tensor, ilens)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor], Tensor, OrderedDict]

Offline TFGridNetV2 for speech separation.

This model improves upon TFGridNet by vectorizing multiple heads in self-attention and enhancing the handling of Deconv1D operations when emb_ks equals emb_hs.

References

[1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and : S. Watanabe, “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation”, in TASLP, 2023.

[2] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and : S. Watanabe, “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.

Notes

For optimal performance, train this model with variance normalized mixture inputs and targets. For a mixture of shape [batch, samples, microphones], normalize it by dividing by torch.std(mixture, (1, 2)). This normalization should also be applied to target signals. It is particularly recommended when not using scale-invariant loss functions such as SI-SDR. The normalization steps are as follows:

std_

= std(mix) mix = mix /

std_

tgt = tgt /

std_

Parameters:
- input_dim (int) – Placeholder, not used.
- n_srcs (int) – Number of output sources/speakers.
- n_fft (int) – STFT window size.
- stride (int) – STFT stride.
- window (str) – STFT window type; options are ‘hamming’, ‘hanning’, or None.
- n_imics (int) – Number of microphone channels (only fixed-array geometry supported).
- n_layers (int) – Number of TFGridNetV2 blocks.
- lstm_hidden_units (int) – Number of hidden units in LSTM.
- attn_n_head (int) – Number of heads in self-attention.
- attn_approx_qk_dim (int) – Approximate dimension of frame-level key and value tensors.
- emb_dim (int) – Embedding dimension.
- emb_ks (int) – Kernel size for unfolding and Deconv1D.
- emb_hs (int) – Hop size for unfolding and Deconv1D.
- activation (str) – Activation function to use in the model; can be any torch-supported activation, e.g., ‘relu’ or ‘elu’.
- eps (float) – Small epsilon for normalization layers.
- use_builtin_complex (bool) – Whether to use the built-in complex type.

######### Examples

model = TFGridNetV2(n_srcs=2, n_fft=256) input_tensor = torch.randn(10, 256, 1) # Batch of 10 samples ilens = torch.tensor([256] * 10) # Input lengths enhanced, ilens, _ = model(input_tensor, ilens)

property num_spk

static pad2(input_tensor, target_len)

Offline TFGridNetV2.

Compared with TFGridNet, TFGridNetV2 speeds up the code by vectorizing multiple heads in self-attention and better handling Deconv1D in each intra- and inter-block when emb_ks == emb_hs.

References

[1] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, : “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation”, in TASLP, 2023.

[2] Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, : “TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation”, in ICASSP, 2023.

Notes

As outlined in the References, this model works best when trained with variance normalized mixture input and target, e.g., with mixture of shape [batch, samples, microphones], you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signals. It is encouraged to do so when not using scale-invariant loss functions such as SI-SDR. Specifically, use:

std_

= std(mix) mix = mix /

std_

tgt = tgt /

std_

Parameters:
- input_dim (int) – Placeholder, not used.
- n_srcs (int) – Number of output sources/speakers.
- n_fft (int) – STFT window size.
- stride (int) – STFT stride.
- window (str) – STFT window type; choose between ‘hamming’, ‘hanning’ or None.
- n_imics (int) – Number of microphone channels (only fixed-array geometry supported).
- n_layers (int) – Number of TFGridNetV2 blocks.
- lstm_hidden_units (int) – Number of hidden units in LSTM.
- attn_n_head (int) – Number of heads in self-attention.
- attn_approx_qk_dim (int) – Approximate dimension of frame-level key and value tensors.
- emb_dim (int) – Embedding dimension.
- emb_ks (int) – Kernel size for unfolding and deconv1D.
- emb_hs (int) – Hop size for unfolding and deconv1D.
- activation (str) – Activation function to use in the whole TFGridNetV2 model, e.g., ‘relu’ or ‘elu’.
- eps (float) – Small epsilon for normalization layers.
- use_builtin_complex (bool) – Whether to use built-in complex type or not.

######### Examples

>>> model = TFGridNetV2(n_srcs=2, n_fft=128)
>>> input_tensor = torch.randn(10, 16000)  # Batch of 10 audio samples
>>> ilens = torch.tensor([16000] * 10)  # Lengths of each sample
>>> enhanced, ilens, _ = model(input_tensor, ilens)