espnet2.layers.utterance_mvn.utterance_mvn

About 1 min

espnet2.layers.utterance_mvn.utterance_mvn

espnet2.layers.utterance_mvn.utterance_mvn(x: Tensor, ilens: Tensor | None = None, norm_means: bool = True, norm_vars: bool = False, eps: float = 1e-20) → Tuple[Tensor, Tensor]

Apply utterance mean and variance normalization.

This function normalizes the input tensor x by subtracting the mean and optionally dividing by the standard deviation (computed from the variance) across the time dimension, while taking into account zero-padded regions based on the input lengths ilens.

espnet2.layers.utterance_mvn.norm_means

Whether to normalize the means.

Type: bool

espnet2.layers.utterance_mvn.norm_vars

Whether to normalize the variances.

Type: bool

espnet2.layers.utterance_mvn.eps

A small constant to avoid division by zero.

Type: float
Parameters:
- x (torch.Tensor) – Input tensor of shape (B, T, D), where B is the batch size, T is the sequence length, and D is the feature dimension. It is assumed to be zero-padded.
- ilens (torch.Tensor , optional) – Tensor of shape (B,) containing the actual lengths of each sequence in the batch. If not provided, it will be set to the maximum sequence length in the batch.
- norm_means (bool) – Flag to indicate whether to normalize the means. Defaults to True.
- norm_vars (bool) – Flag to indicate whether to normalize the variances. Defaults to False.
- eps (float) – Small constant to prevent division by zero during variance normalization. Defaults to 1.0e-20.
Returns: A tuple containing the : normalized tensor and the input lengths tensor.
Return type: Tuple[torch.Tensor, torch.Tensor]

Examples

>>> import torch
>>> x = torch.tensor([[[1.0, 2.0], [3.0, 4.0]],
...                    [[5.0, 6.0], [0.0, 0.0]]])
>>> ilens = torch.tensor([2, 1])
>>> normalized_x, normalized_ilens = utterance_mvn(x, ilens)
>>> print(normalized_x)
tensor([[[-1.0000, -1.0000],
         [ 1.0000,  1.0000]],

[[ 1.0000, 1.0000], : [ 0.0000, 0.0000]]])

NOTE

The input tensor x should be a zero-padded tensor, and the padding should be handled appropriately using the ilens argument to avoid affecting the normalization calculations.