espnet2.gan_tts.jets.alignments.AlignmentModule
espnet2.gan_tts.jets.alignments.AlignmentModule
class espnet2.gan_tts.jets.alignments.AlignmentModule(adim, odim, cache_prior=True)
Bases: Module
Alignment Learning Framework proposed for parallel TTS models in:
https://arxiv.org/abs/2108.10447
This module computes the alignment loss between text and acoustic features, facilitating effective training of Text-to-Speech (TTS) models using attention mechanisms.
cache_prior
Whether to cache beta-binomial prior.
- Type: bool
_cache
A cache to store precomputed prior values for efficiency.
- Type: dict
t_conv1
1D convolution layer for text features.
- Type: nn.Conv1d
t_conv2
1D convolution layer for text features.
- Type: nn.Conv1d
f_conv1
1D convolution layer for acoustic features.
- Type: nn.Conv1d
f_conv2
1D convolution layer for acoustic features.
- Type: nn.Conv1d
f_conv3
1D convolution layer for acoustic features.
Type: nn.Conv1d
Parameters:
- adim (int) – Dimension of attention.
- odim (int) – Dimension of feats.
- cache_prior (bool) – Whether to cache beta-binomial prior.
####### Examples
>>> alignment_module = AlignmentModule(adim=256, odim=80)
>>> text = torch.randn(4, 10, 256) # Batch of 4, 10 time steps, adim=256
>>> feats = torch.randn(4, 20, 80) # Batch of 4, 20 time steps, odim=80
>>> text_lengths = torch.tensor([10, 10, 10, 10])
>>> feats_lengths = torch.tensor([20, 20, 20, 20])
>>> log_p_attn = alignment_module(text, feats, text_lengths, feats_lengths)
>>> print(log_p_attn.shape) # Output shape: (4, 20, 10)
Initialize AlignmentModule.
- Parameters:
- adim (int) – Dimension of attention.
- odim (int) – Dimension of feats.
- cache_prior (bool) – Whether to cache beta-binomial prior.
forward(text, feats, text_lengths, feats_lengths, x_masks=None)
Calculate alignment loss.
This method computes the log probability of the attention matrix based on the input text embeddings and acoustic features. It performs several convolutional operations on the input tensors and computes a score based on the Euclidean distance between the features and the text embeddings. The resulting score is masked if a mask tensor is provided, and a beta- binomial prior is added to the log probabilities before returning the final result.
- Parameters:
- text (Tensor) – Batched text embedding (B, T_text, adim).
- feats (Tensor) – Batched acoustic feature (B, T_feats, odim).
- text_lengths (Tensor) – Text length tensor (B,).
- feats_lengths (Tensor) – Feature length tensor (B,).
- x_masks (Tensor , optional) – Mask tensor (B, T_text). Defaults to None.
- Returns: Log probability of attention matrix (B, T_feats, T_text).
- Return type: Tensor
####### Examples
>>> text = torch.randn(2, 5, 256) # Batch of 2, T_text=5, adim=256
>>> feats = torch.randn(2, 10, 80) # Batch of 2, T_feats=10, odim=80
>>> text_lengths = torch.tensor([5, 3]) # Lengths of each text
>>> feats_lengths = torch.tensor([10, 8]) # Lengths of each feature
>>> model = AlignmentModule(adim=256, odim=80)
>>> log_p_attn = model.forward(text, feats, text_lengths, feats_lengths)
>>> print(log_p_attn.shape) # Should print: torch.Size([2, 10, 5])