espnet2.enh.separator.tfgridnetv2_separator.AllHeadPReLULayerNormalization4DCF

About 2 min

espnet2.enh.separator.tfgridnetv2_separator.AllHeadPReLULayerNormalization4DCF

class espnet2.enh.separator.tfgridnetv2_separator.AllHeadPReLULayerNormalization4DCF(input_dimension, eps=1e-05)

Bases: Module

AllHeadPReLULayerNormalization4DCF applies layer normalization across multiple

heads in a tensor with PReLU activation.

This class normalizes the input tensor along specified dimensions, enabling stable training of models that utilize multiple attention heads, particularly in the context of deep learning architectures. It is specifically designed to work with tensors shaped as [B, H, E, T, F], where B is the batch size, H is the number of heads, E is the embedding dimension, T is the sequence length, and F is the number of frequency bins.

gamma

Scale parameter for normalization.

Type: Parameter

beta

Shift parameter for normalization.

Type: Parameter

act

PReLU activation function applied to the input.

Type: PReLU

eps

Small value to avoid division by zero in normalization.

Type: float

Number of heads.

Type: int

Embedding dimension.

Type: int

n_freqs

Number of frequency bins.

Type: int
Parameters:
- input_dimension (Tuple *[*int , int , int ]) – The input dimensions (H, E, n_freqs).
- eps (float , optional) – Small epsilon for numerical stability. Default is 1e-5.
Raises:AssertionError – If input_dimension does not have a length of 3.

###

xamples

>>> layer_norm = AllHeadPReLULayerNormalization4DCF((4, 512, 128))
>>> input_tensor = torch.randn(32, 4, 512, 100, 128)  # [B, H, E, T, F]
>>> output_tensor = layer_norm(input_tensor)
>>> output_tensor.shape
torch.Size([32, 4, 512, 100, 128])  # Output retains the same shape as input

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Perform the forward pass of the TFGridNetV2 model.

This method processes the input audio tensor through the various layers of the TFGridNetV2 model, including the STFT encoder, multiple GridNetV2 blocks, and the STFT decoder, to produce enhanced audio outputs.

Parameters:
- input (torch.Tensor) – A batched multi-channel audio tensor with M audio channels and N samples, shaped as [B, N, M].
- ilens (torch.Tensor) – A tensor containing the input lengths, shaped as [B].
- additional (Dict or None) – A dictionary for any additional data, currently unused in this model.
Returns:
- enhanced (List[Union(torch.Tensor)]): A list of length n_srcs containing mono audio tensors with T samples, shaped as [(B, T), …].
- ilens (torch.Tensor): The input lengths, shaped as (B,).
- additional (OrderedDict): The additional data, currently unused, returned in the output.
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]

###

xamples

>>> model = TFGridNetV2(n_srcs=2, n_fft=256, stride=128)
>>> input_tensor = torch.randn(4, 1024, 1)  # [B, N, M]
>>> ilens = torch.tensor([1024, 1024, 1024, 1024])  # [B]
>>> enhanced, ilens_out, _ = model(input_tensor, ilens)

NOTE

The model works best when trained with variance normalized mixture input and target. For instance, normalize the mixture and target signals as follows:

std_

= torch.std(mixture, (1, 2)) mixture = mixture /

std_

target = target /

std_