espnet2.asr.specaug.specaug.SpecAug

About 2 min

espnet2.asr.specaug.specaug.SpecAug

class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: int | Sequence[int] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: int | Sequence[int] | None = None, time_mask_width_ratio_range: float | Sequence[float] | None = None, num_time_mask: int = 2, replace_with_zero: bool = True)

Bases: AbsSpecAug

SpecAugment module for applying various data augmentation techniques on spectrograms for Automatic Speech Recognition (ASR).

This class implements the SpecAugment method, which introduces several augmentation techniques, including time warping, frequency masking, and time masking. It is designed to improve the robustness of ASR systems by enhancing the training dataset through these augmentations.

Reference: : Daniel S. Park et al. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”

WARNING

When using CUDA mode, time_warp does not guarantee reproducibility due to torch.nn.functional.interpolate.

apply_time_warp

Flag to apply time warping.

Type: bool

apply_freq_mask

Flag to apply frequency masking.

Type: bool

apply_time_mask

Flag to apply time masking.

Type: bool

time_warp

Instance of the TimeWarp class for time warping.

Type:TimeWarp

freq_mask

Instance of the MaskAlongAxis class for frequency masking.

Type:MaskAlongAxis

time_mask

Instance of the MaskAlongAxis or MaskAlongAxisVariableMaxWidth class for time masking.

Type: Union[MaskAlongAxis, MaskAlongAxisVariableMaxWidth]
Parameters:
- apply_time_warp (bool) – If True, apply time warping. Defaults to True.
- time_warp_window (int) – Window size for time warping. Defaults to 5.
- time_warp_mode (str) – Interpolation mode for time warping. Defaults to “bicubic”.
- apply_freq_mask (bool) – If True, apply frequency masking. Defaults to True.
- freq_mask_width_range (Union *[*int , Sequence *[*int ] ]) – Range of width for frequency masking. Defaults to (0, 20).
- num_freq_mask (int) – Number of frequency masks to apply. Defaults to 2.
- apply_time_mask (bool) – If True, apply time masking. Defaults to True.
- time_mask_width_range (Optional *[*Union *[*int , Sequence *[*int ] ] ]) – Range of width for time masking. Defaults to None.
- time_mask_width_ratio_range (Optional *[*Union *[*float , Sequence *[*float ] ] ]) – Ratio range for time masking width. Defaults to None.
- num_time_mask (int) – Number of time masks to apply. Defaults to 2.
- replace_with_zero (bool) – If True, replace masked values with zero. Defaults to True.
Raises:ValueError – If none of the augmentation methods are applied, or if both time_mask_width_range and time_mask_width_ratio_range are set.

####### Examples

Create a SpecAug instance with default parameters

spec_aug = SpecAug()

Apply augmentation on a batch of audio features

augmented_features, augmented_lengths = spec_aug.forward(features, lengths)

NOTE

The augmentations are applied in the following order: time warping, frequency masking, and then time masking.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x, x_lengths=None)

Apply the SpecAugment transformations to the input audio tensor.

This method applies a series of augmentations, including time warping, frequency masking, and time masking, to the input audio tensor x. Each augmentation is applied sequentially based on the specified parameters during the initialization of the SpecAug class.

Parameters:
- x (torch.Tensor) – The input audio tensor with shape (batch_size, num_channels, time_steps, freq_bins).
- x_lengths (Optional *[*torch.Tensor ]) – An optional tensor containing the lengths of the input sequences. It has shape (batch_size,).
Returns: A tuple containing the augmented audio tensor and the updated lengths tensor. If x_lengths is not provided, the second element of the tuple will be None.
Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]
Raises:ValueError – If the input tensor x is not of the expected shape.

####### Examples

>>> import torch
>>> specaug = SpecAug()
>>> audio_tensor = torch.rand(4, 1, 16000, 80)  # Example tensor
>>> lengths = torch.tensor([16000, 16000, 16000, 16000])  # Example lengths
>>> augmented_tensor, augmented_lengths = specaug.forward(audio_tensor, lengths)

NOTE

The augmentations are performed in the following order:

Time Warping
Frequency Masking
Time Masking

WARNING

When using CUDA mode, the time warping may not be reproducible due to torch.nn.functional.interpolate.