espnet2.s2st.synthesizer.translatotron2.DurationPredictor

About 1 min

espnet2.s2st.synthesizer.translatotron2.DurationPredictor

class espnet2.s2st.synthesizer.translatotron2.DurationPredictor(cfg)

Bases: Module

Non-Attentive Tacotron (NAT) Duration Predictor module.

This module predicts the duration of phonemes based on the encoder outputs. It utilizes a bidirectional LSTM to model the temporal dependencies in the input sequences, followed by a linear layer to output the duration predictions.

lstm

Bidirectional LSTM layer for duration prediction.

Type: nn.LSTM

proj

Linear layer to project LSTM outputs to duration values.

Type: nn.LinearNorm

relu

ReLU activation function.

Type: nn.ReLU
Parameters:cfg –
Configuration object containing parameters for the duration predictor. Expected attributes in cfg include:
- units (int): Number of input features for the LSTM.
- duration_lstm_dim (int): Dimension of the LSTM output features.
Returns: Duration predictions of shape [batch_size, hidden_length].
Return type: Tensor
Raises:ValueError – If the input lengths are not compatible with the encoder outputs.

####### Examples

>>> predictor = DurationPredictor(cfg)
>>> encoder_outputs = torch.randn(16, 50, cfg.units)  # Example tensor
>>> durations = predictor(encoder_outputs)
>>> print(durations.shape)  # Output shape will be [16, 50]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(encoder_outputs, input_lengths=None)

Forward Duration Predictor.

This method processes the encoder outputs through an LSTM layer to predict the duration of each phoneme. It optionally accepts input lengths to remove padding from the encoder outputs.

Parameters:
- encoder_outputs (torch.Tensor) – A tensor of shape [batch_size, hidden_length, encoder_lstm_dim] representing the outputs from the encoder.
- input_lengths (torch.Tensor , optional) – A tensor of shape [batch_size] that specifies the actual lengths of the inputs for each batch, used to handle padding. Defaults to None.
Returns: A tensor of shape [batch_size, hidden_length] that contains the predicted durations for each phoneme.
Return type: torch.Tensor

####### Examples

>>> duration_predictor = DurationPredictor(cfg)
>>> encoder_outputs = torch.randn(16, 50, 256)  # Example tensor
>>> input_lengths = torch.tensor([50] * 16)  # No padding
>>> durations = duration_predictor(encoder_outputs, input_lengths)
>>> print(durations.shape)  # Output: torch.Size([16, 50])