espnet2.asr.encoder.avhubert_encoder.time_masking

Less than 1 minute

espnet2.asr.encoder.avhubert_encoder.time_masking

espnet2.asr.encoder.avhubert_encoder.time_masking(xs_pad, min_T=5, max_T=20)

Mask contiguous frames of audio or video inputs with random lengths.

This function applies random masking to contiguous frames in the input tensor xs_pad, simulating occlusion in audio or video data. The length of the mask is randomly chosen from the range [min_T, max_T].

Parameters:
- xs_pad (torch.Tensor) – The input tensor of shape (B, D, L), where B is the batch size, D is the number of features, and L is the sequence length.
- min_T (int , optional) – The minimum length of the mask. Default is 5.
- max_T (int , optional) – The maximum length of the mask. Default is 20.
Returns: The masked input tensor of the same shape as xs_pad.
Return type: torch.Tensor

Examples

>>> xs_pad = torch.randn(2, 10, 100)  # Batch of 2, 10 features, 100 length
>>> masked_output = time_masking(xs_pad, min_T=3, max_T=10)
>>> masked_output.shape
torch.Size([2, 10, 100])  # Output shape remains the same

NOTE

The masking is applied independently for each batch element.

Raises:
- ValueError – If min_T or max_T is less than 1, or if min_T is
- greater than max_T. –