espnet2.s2st.synthesizer.translatotron.Translatotron
espnet2.s2st.synthesizer.translatotron.Translatotron
class espnet2.s2st.synthesizer.translatotron.Translatotron(idim: int, odim: int, embed_dim: int = 512, atype: str = 'multihead', adim: int = 512, aheads: int = 4, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 4, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 32, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: str | None = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 2, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'concat', dropout_rate: float = 0.5, zoneout_rate: float = 0.1)
Bases: AbsSynthesizer
Translatotron Synthesizer related modules for speech-to-speech translation.
This module is part of the Spectrogram prediction network in Translatotron described in Direct speech-to-speech translation with a sequence-to-sequence model, which converts the sequence of hidden states into the sequence of Mel-filterbanks.
idim
Dimension of the inputs.
- Type: int
odim
Dimension of the outputs.
- Type: int
atype
Type of attention mechanism.
- Type: str
cumulate_att_w
Whether to cumulate previous attention weight.
- Type: bool
reduction_factor
Reduction factor for outputs.
- Type: int
output_activation_fn
Activation function for the output.
- Type: callable, optional
padding_idx
Index used for padding in input sequences.
- Type: int
spks
Number of speakers.
- Type: Optional[int]
langs
Number of languages.
- Type: Optional[int]
spk_embed_dim
Dimension of speaker embeddings.
- Type: Optional[int]
sid_emb
Embedding layer for speaker IDs.
- Type: torch.nn.Embedding, optional
lid_emb
Embedding layer for language IDs.
- Type: torch.nn.Embedding, optional
projection
Linear projection for speaker embeddings.
Type: torch.nn.Linear, optional
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of the token embedding (default=512).
- atype (str) – Type of attention (default=”multihead”).
- adim (int) – Number of dimensions of MLP in attention (default=512).
- aheads (int) – Number of attention heads (default=4).
- aconv_chans (int) – Number of attention convolution filter channels (default=32).
- aconv_filts (int) – Size of attention convolution filter (default=15).
- cumulate_att_w (bool) – Whether to cumulate previous attention weight (default=True).
- dlayers (int) – Number of decoder LSTM layers (default=4).
- dunits (int) – Number of decoder LSTM units (default=1024).
- prenet_layers (int) – Number of prenet layers (default=2).
- prenet_units (int) – Number of prenet units (default=32).
- postnet_layers (int) – Number of postnet layers (default=5).
- postnet_chans (int) – Number of postnet filter channels (default=512).
- postnet_filts (int) – Size of postnet filter (default=5).
- output_activation (Optional *[*str ]) – Name of activation function for outputs.
- use_batch_norm (bool) – Whether to use batch normalization (default=True).
- use_concate (bool) – Whether to concatenate encoder outputs with decoder LSTM outputs (default=True).
- use_residual (bool) – Whether to use residual connections (default=False).
- reduction_factor (int) – Reduction factor (default=2).
- spks (Optional *[*int ]) – Number of speakers (default=None).
- langs (Optional *[*int ]) – Number of languages (default=None).
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension (default=None).
- spk_embed_integration_type (str) – How to integrate speaker embedding (default=”concat”).
- dropout_rate (float) – Dropout rate (default=0.5).
- zoneout_rate (float) – Zoneout rate (default=0.1).
####### Examples
>>> model = Translatotron(idim=80, odim=80)
>>> enc_outputs = torch.randn(10, 100, 80) # Example encoder outputs
>>> enc_outputs_lengths = torch.randint(1, 100, (10,))
>>> feats = torch.randn(10, 50, 80) # Example target features
>>> feats_lengths = torch.randint(1, 50, (10,))
>>> outputs = model(enc_outputs, enc_outputs_lengths, feats, feats_lengths)
- Raises:
- ValueError – If an unsupported activation function is provided.
- NotImplementedError – If an unsupported attention type is provided.
Initialize Tacotron2 module.
- Parameters:
- idim (int) – Dimension of the inputs.
- odim – (int) Dimension of the outputs.
- adim (int) – Number of dimension of mlp in attention.
- atype (str) – type of attention
- aconv_chans (int) – Number of attention conv filter channels.
- aconv_filts (int) – Number of attention conv filter size.
- embed_dim (int) – Dimension of the token embedding.
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- prenet_layers (int) – Number of prenet layers.
- prenet_units (int) – Number of prenet units.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- output_activation (str) – Name of activation function for outputs.
- cumulate_att_w (bool) – Whether to cumulate previous attention weight.
- use_batch_norm (bool) – Whether to use batch normalization.
- use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- dropout_rate (float) – Dropout rate.
- zoneout_rate (float) – Zoneout rate.
forward(enc_outputs: Tensor, enc_outputs_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None) → Tuple[Tensor, Tensor]
Calculate forward propagation.
This method performs the forward pass through the Translatotron model, processing the encoded outputs and generating the corresponding target features. It also computes the attention weights and stop labels.
- Parameters:
- enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).
- enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- Returns: A tuple containing: : - after_outs: Output features after the forward pass.
- before_outs: Output features before the forward pass.
- logits: Logits for stop prediction.
- att_ws: Attention weights.
- ys: Ground truth features.
- labels: Labels for stop prediction.
- olens: Lengths of output sequences.
- Return type: Tuple[torch.Tensor, torch.Tensor]
NOTE
The method assumes that input tensors are properly padded and that their lengths are provided. The maximum lengths of the inputs are used to slice the tensors for processing.
####### Examples
>>> enc_outputs = torch.randn(2, 10, 80) # Batch of 2
>>> enc_outputs_lengths = torch.tensor([10, 8])
>>> feats = torch.randn(2, 20, 80) # Batch of 2
>>> feats_lengths = torch.tensor([20, 18])
>>> after_outs, before_outs, logits, att_ws, ys, labels, olens =
... model.forward(enc_outputs, enc_outputs_lengths, feats,
... feats_lengths)
- Raises:ValueError – If the provided activation function name is invalid.
inference(enc_outputs: Tensor, feats: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, Tensor]
Generate the sequence of features given the sequences of characters.
- Parameters:
- enc_outputs (LongTensor) – Input sequence of characters (N, idim).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style (N, odim).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_att_constraint (bool) – Whether to apply attention constraint.
- backward_window (int) – Backward window in attention constraint.
- forward_window (int) – Forward window in attention constraint.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
- Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Attention weights (T_feats, T).
- Return type: Dict[str, Tensor]