espnet2.asr.encoder.avhubert_encoder.time_masking
Less than 1 minute
espnet2.asr.encoder.avhubert_encoder.time_masking
espnet2.asr.encoder.avhubert_encoder.time_masking(xs_pad, min_T=5, max_T=20)
Mask contiguous frames of audio or video inputs with random lengths.
This function applies random masking to contiguous frames in the input tensor xs_pad, simulating occlusion in audio or video data. The length of the mask is randomly chosen from the range [min_T, max_T].
- Parameters:
- xs_pad (torch.Tensor) – The input tensor of shape (B, D, L), where B is the batch size, D is the number of features, and L is the sequence length.
- min_T (int , optional) – The minimum length of the mask. Default is 5.
- max_T (int , optional) – The maximum length of the mask. Default is 20.
- Returns: The masked input tensor of the same shape as xs_pad.
- Return type: torch.Tensor
Examples
>>> xs_pad = torch.randn(2, 10, 100) # Batch of 2, 10 features, 100 length
>>> masked_output = time_masking(xs_pad, min_T=3, max_T=10)
>>> masked_output.shape
torch.Size([2, 10, 100]) # Output shape remains the same
NOTE
The masking is applied independently for each batch element.
- Raises:
- ValueError – If min_T or max_T is less than 1, or if min_T is
- greater than max_T. –