espnet2.asr.specaug.specaug.SpecAug
espnet2.asr.specaug.specaug.SpecAug
class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: int | Sequence[int] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: int | Sequence[int] | None = None, time_mask_width_ratio_range: float | Sequence[float] | None = None, num_time_mask: int = 2, replace_with_zero: bool = True)
Bases: AbsSpecAug
SpecAugment module for applying various data augmentation techniques on spectrograms for Automatic Speech Recognition (ASR).
This class implements the SpecAugment method, which introduces several augmentation techniques, including time warping, frequency masking, and time masking. It is designed to improve the robustness of ASR systems by enhancing the training dataset through these augmentations.
Reference: : Daniel S. Park et al. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”
WARNING
When using CUDA mode, time_warp does not guarantee reproducibility due to torch.nn.functional.interpolate.
apply_time_warp
Flag to apply time warping.
- Type: bool
apply_freq_mask
Flag to apply frequency masking.
- Type: bool
apply_time_mask
Flag to apply time masking.
- Type: bool
time_warp
Instance of the TimeWarp class for time warping.
- Type:TimeWarp
freq_mask
Instance of the MaskAlongAxis class for frequency masking.
- Type:MaskAlongAxis
time_mask
Instance of the MaskAlongAxis or MaskAlongAxisVariableMaxWidth class for time masking.
Type: Union[MaskAlongAxis, MaskAlongAxisVariableMaxWidth]
Parameters:
- apply_time_warp (bool) – If True, apply time warping. Defaults to True.
- time_warp_window (int) – Window size for time warping. Defaults to 5.
- time_warp_mode (str) – Interpolation mode for time warping. Defaults to “bicubic”.
- apply_freq_mask (bool) – If True, apply frequency masking. Defaults to True.
- freq_mask_width_range (Union *[*int , Sequence *[*int ] ]) – Range of width for frequency masking. Defaults to (0, 20).
- num_freq_mask (int) – Number of frequency masks to apply. Defaults to 2.
- apply_time_mask (bool) – If True, apply time masking. Defaults to True.
- time_mask_width_range (Optional *[*Union *[*int , Sequence *[*int ] ] ]) – Range of width for time masking. Defaults to None.
- time_mask_width_ratio_range (Optional *[*Union *[*float , Sequence *[*float ] ] ]) – Ratio range for time masking width. Defaults to None.
- num_time_mask (int) – Number of time masks to apply. Defaults to 2.
- replace_with_zero (bool) – If True, replace masked values with zero. Defaults to True.
Raises:ValueError – If none of the augmentation methods are applied, or if both time_mask_width_range and time_mask_width_ratio_range are set.
####### Examples
Create a SpecAug instance with default parameters
spec_aug = SpecAug()
Apply augmentation on a batch of audio features
augmented_features, augmented_lengths = spec_aug.forward(features, lengths)
NOTE
The augmentations are applied in the following order: time warping, frequency masking, and then time masking.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x, x_lengths=None)
Apply the SpecAugment transformations to the input audio tensor.
This method applies a series of augmentations, including time warping, frequency masking, and time masking, to the input audio tensor x. Each augmentation is applied sequentially based on the specified parameters during the initialization of the SpecAug class.
- Parameters:
- x (torch.Tensor) – The input audio tensor with shape (batch_size, num_channels, time_steps, freq_bins).
- x_lengths (Optional *[*torch.Tensor ]) – An optional tensor containing the lengths of the input sequences. It has shape (batch_size,).
- Returns: A tuple containing the augmented audio tensor and the updated lengths tensor. If x_lengths is not provided, the second element of the tuple will be None.
- Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]
- Raises:ValueError – If the input tensor x is not of the expected shape.
####### Examples
>>> import torch
>>> specaug = SpecAug()
>>> audio_tensor = torch.rand(4, 1, 16000, 80) # Example tensor
>>> lengths = torch.tensor([16000, 16000, 16000, 16000]) # Example lengths
>>> augmented_tensor, augmented_lengths = specaug.forward(audio_tensor, lengths)
NOTE
The augmentations are performed in the following order:
- Time Warping
- Frequency Masking
- Time Masking
WARNING
When using CUDA mode, the time warping may not be reproducible due to torch.nn.functional.interpolate.