espnet2.s2st.synthesizer.translatotron.Translatotron

About 5 min

espnet2.s2st.synthesizer.translatotron.Translatotron

class espnet2.s2st.synthesizer.translatotron.Translatotron(idim: int, odim: int, embed_dim: int = 512, atype: str = 'multihead', adim: int = 512, aheads: int = 4, aconv_chans: int = 32, aconv_filts: int = 15, cumulate_att_w: bool = True, dlayers: int = 4, dunits: int = 1024, prenet_layers: int = 2, prenet_units: int = 32, postnet_layers: int = 5, postnet_chans: int = 512, postnet_filts: int = 5, output_activation: str | None = None, use_batch_norm: bool = True, use_concate: bool = True, use_residual: bool = False, reduction_factor: int = 2, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'concat', dropout_rate: float = 0.5, zoneout_rate: float = 0.1)

Bases: AbsSynthesizer

Translatotron Synthesizer related modules for speech-to-speech translation.

This module is part of the Spectrogram prediction network in Translatotron described in Direct speech-to-speech translation with a sequence-to-sequence model, which converts the sequence of hidden states into the sequence of Mel-filterbanks.

idim

Dimension of the inputs.

Type: int

odim

Dimension of the outputs.

Type: int

atype

Type of attention mechanism.

Type: str

cumulate_att_w

Whether to cumulate previous attention weight.

Type: bool

reduction_factor

Reduction factor for outputs.

Type: int

output_activation_fn

Activation function for the output.

Type: callable, optional

padding_idx

Index used for padding in input sequences.

Type: int

spks

Number of speakers.

Type: Optional[int]

langs

Number of languages.

Type: Optional[int]

spk_embed_dim

Dimension of speaker embeddings.

Type: Optional[int]

sid_emb

Embedding layer for speaker IDs.

Type: torch.nn.Embedding, optional

lid_emb

Embedding layer for language IDs.

Type: torch.nn.Embedding, optional

projection

Linear projection for speaker embeddings.

Type: torch.nn.Linear, optional
Parameters:
- idim (int) – Dimension of the inputs.
- odim (int) – Dimension of the outputs.
- embed_dim (int) – Dimension of the token embedding (default=512).
- atype (str) – Type of attention (default=”multihead”).
- adim (int) – Number of dimensions of MLP in attention (default=512).
- aheads (int) – Number of attention heads (default=4).
- aconv_chans (int) – Number of attention convolution filter channels (default=32).
- aconv_filts (int) – Size of attention convolution filter (default=15).
- cumulate_att_w (bool) – Whether to cumulate previous attention weight (default=True).
- dlayers (int) – Number of decoder LSTM layers (default=4).
- dunits (int) – Number of decoder LSTM units (default=1024).
- prenet_layers (int) – Number of prenet layers (default=2).
- prenet_units (int) – Number of prenet units (default=32).
- postnet_layers (int) – Number of postnet layers (default=5).
- postnet_chans (int) – Number of postnet filter channels (default=512).
- postnet_filts (int) – Size of postnet filter (default=5).
- output_activation (Optional *[*str ]) – Name of activation function for outputs.
- use_batch_norm (bool) – Whether to use batch normalization (default=True).
- use_concate (bool) – Whether to concatenate encoder outputs with decoder LSTM outputs (default=True).
- use_residual (bool) – Whether to use residual connections (default=False).
- reduction_factor (int) – Reduction factor (default=2).
- spks (Optional *[*int ]) – Number of speakers (default=None).
- langs (Optional *[*int ]) – Number of languages (default=None).
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension (default=None).
- spk_embed_integration_type (str) – How to integrate speaker embedding (default=”concat”).
- dropout_rate (float) – Dropout rate (default=0.5).
- zoneout_rate (float) – Zoneout rate (default=0.1).

####### Examples

>>> model = Translatotron(idim=80, odim=80)
>>> enc_outputs = torch.randn(10, 100, 80)  # Example encoder outputs
>>> enc_outputs_lengths = torch.randint(1, 100, (10,))
>>> feats = torch.randn(10, 50, 80)  # Example target features
>>> feats_lengths = torch.randint(1, 50, (10,))
>>> outputs = model(enc_outputs, enc_outputs_lengths, feats, feats_lengths)

Raises:
- ValueError – If an unsupported activation function is provided.
- NotImplementedError – If an unsupported attention type is provided.

Initialize Tacotron2 module.

Parameters:
- idim (int) – Dimension of the inputs.
- odim – (int) Dimension of the outputs.
- adim (int) – Number of dimension of mlp in attention.
- atype (str) – type of attention
- aconv_chans (int) – Number of attention conv filter channels.
- aconv_filts (int) – Number of attention conv filter size.
- embed_dim (int) – Dimension of the token embedding.
- dlayers (int) – Number of decoder lstm layers.
- dunits (int) – Number of decoder lstm units.
- prenet_layers (int) – Number of prenet layers.
- prenet_units (int) – Number of prenet units.
- postnet_layers (int) – Number of postnet layers.
- postnet_filts (int) – Number of postnet filter size.
- postnet_chans (int) – Number of postnet filter channels.
- output_activation (str) – Name of activation function for outputs.
- cumulate_att_w (bool) – Whether to cumulate previous attention weight.
- use_batch_norm (bool) – Whether to use batch normalization.
- use_concate (bool) – Whether to concat enc outputs w/ dec lstm outputs.
- reduction_factor (int) – Reduction factor.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.
- dropout_rate (float) – Dropout rate.
- zoneout_rate (float) – Zoneout rate.

forward(enc_outputs: Tensor, enc_outputs_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None) → Tuple[Tensor, Tensor]

Calculate forward propagation.

This method performs the forward pass through the Translatotron model, processing the encoded outputs and generating the corresponding target features. It also computes the attention weights and stop labels.

Parameters:
- enc_outputs (LongTensor) – Batch of padded character ids (B, T, idim).
- enc_outputs_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
Returns: A tuple containing: : - after_outs: Output features after the forward pass.
- before_outs: Output features before the forward pass.
- logits: Logits for stop prediction.
- att_ws: Attention weights.
- ys: Ground truth features.
- labels: Labels for stop prediction.
- olens: Lengths of output sequences.
Return type: Tuple[torch.Tensor, torch.Tensor]

NOTE

The method assumes that input tensors are properly padded and that their lengths are provided. The maximum lengths of the inputs are used to slice the tensors for processing.

####### Examples

>>> enc_outputs = torch.randn(2, 10, 80)  # Batch of 2
>>> enc_outputs_lengths = torch.tensor([10, 8])
>>> feats = torch.randn(2, 20, 80)  # Batch of 2
>>> feats_lengths = torch.tensor([20, 18])
>>> after_outs, before_outs, logits, att_ws, ys, labels, olens =
...     model.forward(enc_outputs, enc_outputs_lengths, feats,
...     feats_lengths)

Raises:ValueError – If the provided activation function name is invalid.

inference(enc_outputs: Tensor, feats: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, use_teacher_forcing: bool = False) → Dict[str, Tensor]

Generate the sequence of features given the sequences of characters.

Parameters:
- enc_outputs (LongTensor) – Input sequence of characters (N, idim).
- feats (Optional *[*Tensor ]) – Feature sequence to extract style (N, odim).
- spembs (Optional *[*Tensor ]) – Speaker embedding (spk_embed_dim,).
- sids (Optional *[*Tensor ]) – Speaker ID (1,).
- lids (Optional *[*Tensor ]) – Language ID (1,).
- threshold (float) – Threshold in inference.
- minlenratio (float) – Minimum length ratio in inference.
- maxlenratio (float) – Maximum length ratio in inference.
- use_att_constraint (bool) – Whether to apply attention constraint.
- backward_window (int) – Backward window in attention constraint.
- forward_window (int) – Forward window in attention constraint.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
Returns: Output dict including the following items: : * feat_gen (Tensor): Output sequence of features (T_feats, odim).
- prob (Tensor): Output sequence of stop probabilities (T_feats,).
- att_w (Tensor): Attention weights (T_feats, T).
Return type: Dict[str, Tensor]