espnet2.gan_tts.jets.alignments.AlignmentModule

About 2 min

espnet2.gan_tts.jets.alignments.AlignmentModule

class espnet2.gan_tts.jets.alignments.AlignmentModule(adim, odim, cache_prior=True)

Bases: Module

Alignment Learning Framework proposed for parallel TTS models in:

This module computes the alignment loss between text and acoustic features, facilitating effective training of Text-to-Speech (TTS) models using attention mechanisms.

cache_prior

Whether to cache beta-binomial prior.

Type: bool

_cache

A cache to store precomputed prior values for efficiency.

Type: dict

t_conv1

1D convolution layer for text features.

Type: nn.Conv1d

t_conv2

1D convolution layer for text features.

Type: nn.Conv1d

f_conv1

1D convolution layer for acoustic features.

Type: nn.Conv1d

f_conv2

1D convolution layer for acoustic features.

Type: nn.Conv1d

f_conv3

1D convolution layer for acoustic features.

Type: nn.Conv1d
Parameters:
- adim (int) – Dimension of attention.
- odim (int) – Dimension of feats.
- cache_prior (bool) – Whether to cache beta-binomial prior.

####### Examples

>>> alignment_module = AlignmentModule(adim=256, odim=80)
>>> text = torch.randn(4, 10, 256)  # Batch of 4, 10 time steps, adim=256
>>> feats = torch.randn(4, 20, 80)   # Batch of 4, 20 time steps, odim=80
>>> text_lengths = torch.tensor([10, 10, 10, 10])
>>> feats_lengths = torch.tensor([20, 20, 20, 20])
>>> log_p_attn = alignment_module(text, feats, text_lengths, feats_lengths)
>>> print(log_p_attn.shape)  # Output shape: (4, 20, 10)

Initialize AlignmentModule.

Parameters:
- adim (int) – Dimension of attention.
- odim (int) – Dimension of feats.
- cache_prior (bool) – Whether to cache beta-binomial prior.

forward(text, feats, text_lengths, feats_lengths, x_masks=None)

Calculate alignment loss.

This method computes the log probability of the attention matrix based on the input text embeddings and acoustic features. It performs several convolutional operations on the input tensors and computes a score based on the Euclidean distance between the features and the text embeddings. The resulting score is masked if a mask tensor is provided, and a beta- binomial prior is added to the log probabilities before returning the final result.

Parameters:
- text (Tensor) – Batched text embedding (B, T_text, adim).
- feats (Tensor) – Batched acoustic feature (B, T_feats, odim).
- text_lengths (Tensor) – Text length tensor (B,).
- feats_lengths (Tensor) – Feature length tensor (B,).
- x_masks (Tensor , optional) – Mask tensor (B, T_text). Defaults to None.
Returns: Log probability of attention matrix (B, T_feats, T_text).
Return type: Tensor

####### Examples

>>> text = torch.randn(2, 5, 256)  # Batch of 2, T_text=5, adim=256
>>> feats = torch.randn(2, 10, 80)  # Batch of 2, T_feats=10, odim=80
>>> text_lengths = torch.tensor([5, 3])  # Lengths of each text
>>> feats_lengths = torch.tensor([10, 8])  # Lengths of each feature
>>> model = AlignmentModule(adim=256, odim=80)
>>> log_p_attn = model.forward(text, feats, text_lengths, feats_lengths)
>>> print(log_p_attn.shape)  # Should print: torch.Size([2, 10, 5])