espnet2.s2st.synthesizer.translatotron2.GaussianUpsampling
espnet2.s2st.synthesizer.translatotron2.GaussianUpsampling
class espnet2.s2st.synthesizer.translatotron2.GaussianUpsampling
Bases: Module
Gaussian Upsample.
This module implements Gaussian upsampling for the Non-Attentive Tacotron. It is part of the synthesizer in the ExpressiveTacotron project.
References:
- Non-attention Tacotron: https://arxiv.org/abs/2010.04301
- ExpressiveTacotron: https://github.com/BridgetteSong/ExpressiveTacotron/
mask_score
A constant used to mask out irrelevant weights during the softmax operation.
- Type: float
forward(encoder_outputs, durations, vars, input_lengths=None)
Performs Gaussian upsampling on the provided encoder outputs.
- Parameters:
- encoder_outputs (torch.Tensor) – The encoder outputs of shape [batch_size, hidden_length, dim].
- durations (torch.Tensor) – The phoneme durations of shape [batch_size, hidden_length].
- vars (torch.Tensor) – The phoneme attended ranges of shape [batch_size, hidden_length].
- input_lengths (torch.Tensor , optional) – The lengths of the inputs of shape [batch_size]. Defaults to None.
- Returns: The upsampled encoder outputs of shape : [batch_size, frame_length, dim].
- Return type: torch.Tensor
######### Examples
>>> gaussian_upsample = GaussianUpsampling()
>>> encoder_outputs = torch.randn(2, 5, 256) # Example tensor
>>> durations = torch.tensor([[1, 2, 1, 1, 1], [2, 1, 2, 1, 1]])
>>> vars = torch.tensor([[0.1, 0.2, 0.1, 0.1, 0.1], [0.2, 0.1, 0.2, 0.1, 0.1]])
>>> output = gaussian_upsample(encoder_outputs, durations, vars)
>>> output.shape
torch.Size([2, frame_length, 256])
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(encoder_outputs, durations, vars, input_lengths=None)
Gaussian Upsample.
Non-attention Tacotron: : - https://arxiv.org/abs/2010.04301
This source code is an implementation of the ExpressiveTacotron from BridgetteSong:
mask_score
A constant used for masking scores.
Type: float
Parameters:
- encoder_outputs (torch.Tensor) – Encoder outputs with shape [batch_size, hidden_length, dim].
- durations (torch.Tensor) – Phoneme durations with shape [batch_size, hidden_length].
- vars (torch.Tensor) – Phoneme attended ranges with shape [batch_size, hidden_length].
- input_lengths (torch.Tensor , optional) – Lengths of input sequences with shape [batch_size]. Defaults to None.
Returns: Upsampled encoder outputs with shape : [batch_size, frame_length, dim].
Return type: torch.Tensor
######### Examples
>>> model = GaussianUpsampling()
>>> encoder_outputs = torch.rand(2, 5, 128) # Example tensor
>>> durations = torch.tensor([[1, 2, 1, 3, 1], [1, 1, 1, 1, 1]])
>>> vars = torch.rand(2, 5)
>>> output = model(encoder_outputs, durations, vars)
>>> print(output.shape) # Output shape will be [2, frame_length, 128]
NOTE
The input_lengths argument is optional and can be used to apply masking to the upsampling process.
get_mask_from_lengths(lengths, max_len=None)
Generate a mask from lengths.
This method creates a boolean mask array that indicates which positions in the sequence should be considered valid based on the provided lengths. The mask has a shape of (batch_size, max_len), where max_len is the maximum length specified or the maximum length found in the lengths array.
- Parameters:
- lengths (np.ndarray) – An array of shape (batch_size,) containing the valid lengths for each sequence in the batch.
- max_len (Optional *[*int ]) – The maximum length for the mask. If not provided, it will be set to the maximum value in lengths.
- Returns: A boolean array of shape (batch_size, max_len) : where each row corresponds to a sequence and contains True for valid positions and False for padding positions.
- Return type: np.ndarray
######### Examples
>>> lengths = np.array([3, 5, 2])
>>> mask = get_mask_from_lengths(lengths)
>>> print(mask)
[[ True True True]
[ True True True True True]
[ True True]]
>>> mask_with_max_len = get_mask_from_lengths(lengths, max_len=5)
>>> print(mask_with_max_len)
[[ True True True False False]
[ True True True True True]
[ True True False False False]]
NOTE
The returned mask can be used to filter out padding values in sequence data during model training or inference.