espnet2.gan_tts.jets.alignments.average_by_duration

Less than 1 minute

espnet2.gan_tts.jets.alignments.average_by_duration

espnet2.gan_tts.jets.alignments.average_by_duration(ds, xs, text_lengths, feats_lengths)

Average frame-level features into token-level according to durations.

This function takes in token durations and corresponding feature sequences to compute the average feature for each token based on the specified durations. It is particularly useful in tasks where features need to be aggregated according to their corresponding token lengths, such as in text-to-speech applications.

Parameters:
- ds (Tensor) – Batched token duration (B, T_text).
- xs (Tensor) – Batched feature sequences to be averaged (B, T_feats).
- text_lengths (Tensor) – Text length tensor (B,).
- feats_lengths (Tensor) – Feature length tensor (B,).
Returns: Batched feature averaged according to the token duration (B, T_text).
Return type: Tensor

Examples

>>> ds = torch.tensor([[2, 3], [1, 4]])
>>> xs = torch.tensor([[1.0, 2.0, 3.0, 4.0, 5.0],
...                     [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]])
>>> text_lengths = torch.tensor([2, 2])
>>> feats_lengths = torch.tensor([5, 8])
>>> result = average_by_duration(ds, xs, text_lengths, feats_lengths)
>>> print(result)
tensor([[2.5000, 4.0000],
        [1.0000, 6.0000]])