espnet2.enh.separator.tfgridnetv2_separator.AllHeadPReLULayerNormalization4DCF
espnet2.enh.separator.tfgridnetv2_separator.AllHeadPReLULayerNormalization4DCF
class espnet2.enh.separator.tfgridnetv2_separator.AllHeadPReLULayerNormalization4DCF(input_dimension, eps=1e-05)
Bases: Module
AllHeadPReLULayerNormalization4DCF applies layer normalization across multiple
heads in a tensor with PReLU activation.
This class normalizes the input tensor along specified dimensions, enabling stable training of models that utilize multiple attention heads, particularly in the context of deep learning architectures. It is specifically designed to work with tensors shaped as [B, H, E, T, F], where B is the batch size, H is the number of heads, E is the embedding dimension, T is the sequence length, and F is the number of frequency bins.
gamma
Scale parameter for normalization.
- Type: Parameter
beta
Shift parameter for normalization.
- Type: Parameter
act
PReLU activation function applied to the input.
- Type: PReLU
eps
Small value to avoid division by zero in normalization.
- Type: float
H
Number of heads.
- Type: int
E
Embedding dimension.
- Type: int
n_freqs
Number of frequency bins.
Type: int
Parameters:
- input_dimension (Tuple *[*int , int , int ]) – The input dimensions (H, E, n_freqs).
- eps (float , optional) – Small epsilon for numerical stability. Default is 1e-5.
Raises:AssertionError – If input_dimension does not have a length of 3.
###
E
>>> layer_norm = AllHeadPReLULayerNormalization4DCF((4, 512, 128))
>>> input_tensor = torch.randn(32, 4, 512, 100, 128) # [B, H, E, T, F]
>>> output_tensor = layer_norm(input_tensor)
>>> output_tensor.shape
torch.Size([32, 4, 512, 100, 128]) # Output retains the same shape as input
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x)
Perform the forward pass of the TFGridNetV2 model.
This method processes the input audio tensor through the various layers of the TFGridNetV2 model, including the STFT encoder, multiple GridNetV2 blocks, and the STFT decoder, to produce enhanced audio outputs.
- Parameters:
- input (torch.Tensor) – A batched multi-channel audio tensor with M audio channels and N samples, shaped as [B, N, M].
- ilens (torch.Tensor) – A tensor containing the input lengths, shaped as [B].
- additional (Dict or None) – A dictionary for any additional data, currently unused in this model.
- Returns:
- enhanced (List[Union(torch.Tensor)]): A list of length n_srcs containing mono audio tensors with T samples, shaped as [(B, T), …].
- ilens (torch.Tensor): The input lengths, shaped as (B,).
- additional (OrderedDict): The additional data, currently unused, returned in the output.
- Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]
###
E
>>> model = TFGridNetV2(n_srcs=2, n_fft=256, stride=128)
>>> input_tensor = torch.randn(4, 1024, 1) # [B, N, M]
>>> ilens = torch.tensor([1024, 1024, 1024, 1024]) # [B]
>>> enhanced, ilens_out, _ = model(input_tensor, ilens)
NOTE
The model works best when trained with variance normalized mixture input and target. For instance, normalize the mixture and target signals as follows:
std_
= torch.std(mixture, (1, 2)) mixture = mixture /
std_
target = target /
std_