espnet2.enh.separator.tfgridnetv2_separator.LayerNormalization4DCF

About 2 min

espnet2.enh.separator.tfgridnetv2_separator.LayerNormalization4DCF

class espnet2.enh.separator.tfgridnetv2_separator.LayerNormalization4DCF(input_dimension, eps=1e-05)

Bases: Module

LayerNormalization4DCF is a layer normalization module designed for use in

deep learning models that require normalization across specific dimensions for stability and performance, particularly in the context of time-frequency domain processing.

gamma

Learnable scale parameter for normalization.

Type: Parameter

beta

Learnable shift parameter for normalization.

Type: Parameter

eps

A small constant added to the variance for numerical stability.

Type: float
Parameters:
- input_dimension (Tuple *[*int , int ]) – A tuple representing the input dimensions, where the first element is the number of features and the second is the number of frequency bins.
- eps (float , optional) – A small value to prevent division by zero during normalization. Defaults to 1e-5.
Raises:ValueError – If the input tensor does not have 4 dimensions.

####### Examples

>>> layer_norm = LayerNormalization4DCF((128, 64))
>>> input_tensor = torch.randn(32, 128, 10, 64)  # [B, C, T, F]
>>> output_tensor = layer_norm(input_tensor)
>>> print(output_tensor.shape)  # Should be [32, 128, 10, 64]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

Forward pass through the TFGridNetV2 model.

This method processes the input audio tensor and produces enhanced audio signals for each source. The input is first normalized and then passed through the model’s layers to obtain the enhanced outputs.

Parameters:
- input (torch.Tensor) – Batched multi-channel audio tensor with M audio channels and N samples of shape [B, N, M].
- ilens (torch.Tensor) – Input lengths for each sample in the batch, shape [B].
- additional (Dict or None) – Other data that can be passed to the model, currently unused.
Returns: A tuple containing: : - enhanced (List[torch.Tensor]): A list of length n_srcs, where each tensor has shape [B, T], representing mono audio tensors with T samples.
- ilens (torch.Tensor): The input lengths, shape [B].
- additional (OrderedDict): The same additional data returned as output, currently unused.
Return type: Tuple[List[torch.Tensor], torch.Tensor, OrderedDict]

####### Examples

>>> model = TFGridNetV2(input_dim=128, n_srcs=2)
>>> input_tensor = torch.randn(4, 512, 1)  # [B, N, M]
>>> ilens = torch.tensor([512, 512, 512, 512])  # Input lengths
>>> enhanced, ilens_out, _ = model(input_tensor, ilens)
>>> print(len(enhanced))  # Should be equal to n_srcs (e.g., 2)

NOTE

It is recommended to normalize the input tensor before passing it to the model, especially when not using scale-invariant loss functions like SI-SDR. Normalization can be performed as follows:

std_

= torch.std(input, dim=(1, 2), keepdim=True) input = input /

std_

Raises:
- AssertionError – If the input tensor shape is not as expected or if
- the number of input microphones is not supported. –