espnet2.s2st.synthesizer.translatotron2.GaussianUpsampling

About 2 min

espnet2.s2st.synthesizer.translatotron2.GaussianUpsampling

class espnet2.s2st.synthesizer.translatotron2.GaussianUpsampling

Bases: Module

Gaussian Upsample.

This module implements Gaussian upsampling for the Non-Attentive Tacotron. It is part of the synthesizer in the ExpressiveTacotron project.

References:

Non-attention Tacotron: https://arxiv.org/abs/2010.04301
ExpressiveTacotron: https://github.com/BridgetteSong/ExpressiveTacotron/

mask_score

A constant used to mask out irrelevant weights during the softmax operation.

Type: float

forward(encoder_outputs, durations, vars, input_lengths=None)

Performs Gaussian upsampling on the provided encoder outputs.

Parameters:
- encoder_outputs (torch.Tensor) – The encoder outputs of shape [batch_size, hidden_length, dim].
- durations (torch.Tensor) – The phoneme durations of shape [batch_size, hidden_length].
- vars (torch.Tensor) – The phoneme attended ranges of shape [batch_size, hidden_length].
- input_lengths (torch.Tensor , optional) – The lengths of the inputs of shape [batch_size]. Defaults to None.
Returns: The upsampled encoder outputs of shape : [batch_size, frame_length, dim].
Return type: torch.Tensor

######### Examples

>>> gaussian_upsample = GaussianUpsampling()
>>> encoder_outputs = torch.randn(2, 5, 256)  # Example tensor
>>> durations = torch.tensor([[1, 2, 1, 1, 1], [2, 1, 2, 1, 1]])
>>> vars = torch.tensor([[0.1, 0.2, 0.1, 0.1, 0.1], [0.2, 0.1, 0.2, 0.1, 0.1]])
>>> output = gaussian_upsample(encoder_outputs, durations, vars)
>>> output.shape
torch.Size([2, frame_length, 256])

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(encoder_outputs, durations, vars, input_lengths=None)

Gaussian Upsample.

Non-attention Tacotron: : - https://arxiv.org/abs/2010.04301

This source code is an implementation of the ExpressiveTacotron from BridgetteSong:

https://github.com/BridgetteSong/ExpressiveTacotron/

mask_score

A constant used for masking scores.

Type: float
Parameters:
- encoder_outputs (torch.Tensor) – Encoder outputs with shape [batch_size, hidden_length, dim].
- durations (torch.Tensor) – Phoneme durations with shape [batch_size, hidden_length].
- vars (torch.Tensor) – Phoneme attended ranges with shape [batch_size, hidden_length].
- input_lengths (torch.Tensor , optional) – Lengths of input sequences with shape [batch_size]. Defaults to None.
Returns: Upsampled encoder outputs with shape : [batch_size, frame_length, dim].
Return type: torch.Tensor

######### Examples

>>> model = GaussianUpsampling()
>>> encoder_outputs = torch.rand(2, 5, 128)  # Example tensor
>>> durations = torch.tensor([[1, 2, 1, 3, 1], [1, 1, 1, 1, 1]])
>>> vars = torch.rand(2, 5)
>>> output = model(encoder_outputs, durations, vars)
>>> print(output.shape)  # Output shape will be [2, frame_length, 128]

NOTE

The input_lengths argument is optional and can be used to apply masking to the upsampling process.

get_mask_from_lengths(lengths, max_len=None)

Generate a mask from lengths.

This method creates a boolean mask array that indicates which positions in the sequence should be considered valid based on the provided lengths. The mask has a shape of (batch_size, max_len), where max_len is the maximum length specified or the maximum length found in the lengths array.

Parameters:
- lengths (np.ndarray) – An array of shape (batch_size,) containing the valid lengths for each sequence in the batch.
- max_len (Optional *[*int ]) – The maximum length for the mask. If not provided, it will be set to the maximum value in lengths.
Returns: A boolean array of shape (batch_size, max_len) : where each row corresponds to a sequence and contains True for valid positions and False for padding positions.
Return type: np.ndarray

######### Examples

>>> lengths = np.array([3, 5, 2])
>>> mask = get_mask_from_lengths(lengths)
>>> print(mask)
[[ True  True  True]
 [ True  True  True  True  True]
 [ True  True]]

>>> mask_with_max_len = get_mask_from_lengths(lengths, max_len=5)
>>> print(mask_with_max_len)
[[ True  True  True False False]
 [ True  True  True  True  True]
 [ True  True False False False]]

NOTE

The returned mask can be used to filter out padding values in sequence data during model training or inference.