espnet2.s2st.synthesizer.translatotron2.Translatotron2

About 4 min

espnet2.s2st.synthesizer.translatotron2.Translatotron2

class espnet2.s2st.synthesizer.translatotron2.Translatotron2(idim: int, odim: int, synthesizer_type: str = 'rnn', layers: int = 2, units: int = 1024, prenet_layers: int = 2, prenet_units: int = 128, prenet_dropout_rate: float = 0.5, postnet_layers: int = 5, postnet_chans: int = 512, postnet_dropout_rate: float = 0.5, adim: int = 384, aheads: int = 4, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, duration_predictor_layers: int = 2, duration_predictor_type: str = 'rnn', duration_predictor_units: int = 128, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False)

Bases: AbsSynthesizer

Translatotron2 module.

This is a module of the synthesizer in Translatotron2 described in

`Translatotron 2: High-quality direct speech-to-speech translation with
voice preservation`_

idim

Input dimension for the model.

Type: int

odim

Output dimension for the model.

Type: int

synthesizer_type

Type of synthesizer (default: “rnn”).

Type: str

layers

Number of layers in the synthesizer (default: 2).

Type: int

units

Number of units in each layer (default: 1024).

Type: int

prenet_layers

Number of layers in the prenet (default: 2).

Type: int

prenet_units

Number of units in the prenet (default: 128).

Type: int

prenet_dropout_rate

Dropout rate for the prenet (default: 0.5).

Type: float

postnet_layers

Number of layers in the postnet (default: 5).

Type: int

postnet_chans

Number of channels in the postnet (default: 512).

Type: int

postnet_dropout_rate

Dropout rate for the postnet (default: 0.5).

Type: float

adim

Dimension of the attention mechanism (default: 384).

Type: int

aheads

Number of attention heads (default: 4).

Type: int

conformer_rel_pos_type

Type of relative positional encoding (default: “legacy”).

Type: str

conformer_pos_enc_layer_type

Layer type for positional encoding (default: “rel_pos”).

Type: str

conformer_self_attn_layer_type

Layer type for self-attention (default: “rel_selfattn”).

Type: str

conformer_activation_type

Activation function type (default: “swish”).

Type: str

use_macaron_style_in_conformer

Whether to use Macaron style in conformer (default: True).

Type: bool

use_cnn_in_conformer

Whether to use CNN in conformer (default: True).

Type: bool

zero_triu

Whether to zero out the upper triangular part of the attention matrix (default: False).

Type: bool

conformer_enc_kernel_size

Kernel size for the conformer encoder (default: 7).

Type: int

conformer_dec_kernel_size

Kernel size for the conformer decoder (default: 31).

Type: int

duration_predictor_layers

Number of layers in the duration predictor (default: 2).

Type: int

duration_predictor_type

Type of duration predictor (default: “rnn”).

Type: str

duration_predictor_units

Number of units in the duration predictor (default: 128).

Type: int

spks

Number of speakers (default: None).

Type: Optional[int]

langs

Number of languages (default: None).

Type: Optional[int]

spk_embed_dim

Dimension of speaker embedding (default: None).

Type: Optional[int]

spk_embed_integration_type

Type of speaker embedding integration (default: “add”).

Type: str

init_type

Initialization type for the model (default: “xavier_uniform”).

Type: str

init_enc_alpha

Initialization alpha for encoder (default: 1.0).

Type: float

init_dec_alpha

Initialization alpha for decoder (default: 1.0).

Type: float

use_masking

Whether to use masking during training (default: False).

Type: bool

use_weighted_masking

Whether to use weighted masking during training (default: False).

Type: bool
Parameters:
- idim (int) – Input dimension for the model.
- odim (int) – Output dimension for the model.
- synthesizer_type (str , optional) – Type of synthesizer (default: “rnn”).
- layers (int , optional) – Number of layers in the synthesizer (default: 2).
- units (int , optional) – Number of units in each layer (default: 1024).
- prenet_layers (int , optional) – Number of layers in the prenet (default: 2).
- prenet_units (int , optional) – Number of units in the prenet (default: 128).
- prenet_dropout_rate (float , optional) – Dropout rate for the prenet (default: 0.5).
- postnet_layers (int , optional) – Number of layers in the postnet (default: 5).
- postnet_chans (int , optional) – Number of channels in the postnet (default: 512).
- postnet_dropout_rate (float , optional) – Dropout rate for the postnet (default: 0.5).
- adim (int , optional) – Dimension of the attention mechanism (default: 384).
- aheads (int , optional) – Number of attention heads (default: 4).
- conformer_rel_pos_type (str , optional) – Type of relative positional encoding (default: “legacy”).
- conformer_pos_enc_layer_type (str , optional) – Layer type for positional encoding (default: “rel_pos”).
- conformer_self_attn_layer_type (str , optional) – Layer type for self-attention (default: “rel_selfattn”).
- conformer_activation_type (str , optional) – Activation function type (default: “swish”).
- use_macaron_style_in_conformer (bool , optional) – Whether to use Macaron style in conformer (default: True).
- use_cnn_in_conformer (bool , optional) – Whether to use CNN in conformer (default: True).
- zero_triu (bool , optional) – Whether to zero out the upper triangular part of the attention matrix (default: False).
- conformer_enc_kernel_size (int , optional) – Kernel size for the conformer encoder (default: 7).
- conformer_dec_kernel_size (int , optional) – Kernel size for the conformer decoder (default: 31).
- duration_predictor_layers (int , optional) – Number of layers in the duration predictor (default: 2).
- duration_predictor_type (str , optional) – Type of duration predictor (default: “rnn”).
- duration_predictor_units (int , optional) – Number of units in the duration predictor (default: 128).
- spks (Optional *[*int ] , optional) – Number of speakers (default: None).
- langs (Optional *[*int ] , optional) – Number of languages (default: None).
- spk_embed_dim (Optional *[*int ] , optional) – Dimension of speaker embedding (default: None).
- spk_embed_integration_type (str , optional) – Type of speaker embedding integration (default: “add”).
- init_type (str , optional) – Initialization type for the model (default: “xavier_uniform”).
- init_enc_alpha (float , optional) – Initialization alpha for encoder (default: 1.0).
- init_dec_alpha (float , optional) – Initialization alpha for decoder (default: 1.0).
- use_masking (bool , optional) – Whether to use masking during training (default: False).
- use_weighted_masking (bool , optional) – Whether to use weighted masking during training (default: False).
Returns: None

Examples

>>> model = Translatotron2(idim=80, odim=80)
>>> print(model)
Translatotron2(...)

NOTE

This class is part of the ESPnet2 speech synthesis framework.

Initialize internal Module state, shared by both nn.Module and ScriptModule.