espnet2.s2st.synthesizer.translatotron2.Translatotron2
espnet2.s2st.synthesizer.translatotron2.Translatotron2
class espnet2.s2st.synthesizer.translatotron2.Translatotron2(idim: int, odim: int, synthesizer_type: str = 'rnn', layers: int = 2, units: int = 1024, prenet_layers: int = 2, prenet_units: int = 128, prenet_dropout_rate: float = 0.5, postnet_layers: int = 5, postnet_chans: int = 512, postnet_dropout_rate: float = 0.5, adim: int = 384, aheads: int = 4, conformer_rel_pos_type: str = 'legacy', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, zero_triu: bool = False, conformer_enc_kernel_size: int = 7, conformer_dec_kernel_size: int = 31, duration_predictor_layers: int = 2, duration_predictor_type: str = 'rnn', duration_predictor_units: int = 128, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'add', init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_masking: bool = False, use_weighted_masking: bool = False)
Bases: AbsSynthesizer
Translatotron2 module.
This is a module of the synthesizer in Translatotron2 described in
`Translatotron 2: High-quality direct speech-to-speech translation with
voice preservation`_
.
idim
Input dimension for the model.
- Type: int
odim
Output dimension for the model.
- Type: int
synthesizer_type
Type of synthesizer (default: “rnn”).
- Type: str
layers
Number of layers in the synthesizer (default: 2).
- Type: int
units
Number of units in each layer (default: 1024).
- Type: int
prenet_layers
Number of layers in the prenet (default: 2).
- Type: int
prenet_units
Number of units in the prenet (default: 128).
- Type: int
prenet_dropout_rate
Dropout rate for the prenet (default: 0.5).
- Type: float
postnet_layers
Number of layers in the postnet (default: 5).
- Type: int
postnet_chans
Number of channels in the postnet (default: 512).
- Type: int
postnet_dropout_rate
Dropout rate for the postnet (default: 0.5).
- Type: float
adim
Dimension of the attention mechanism (default: 384).
- Type: int
aheads
Number of attention heads (default: 4).
- Type: int
conformer_rel_pos_type
Type of relative positional encoding (default: “legacy”).
- Type: str
conformer_pos_enc_layer_type
Layer type for positional encoding (default: “rel_pos”).
- Type: str
conformer_self_attn_layer_type
Layer type for self-attention (default: “rel_selfattn”).
- Type: str
conformer_activation_type
Activation function type (default: “swish”).
- Type: str
use_macaron_style_in_conformer
Whether to use Macaron style in conformer (default: True).
- Type: bool
use_cnn_in_conformer
Whether to use CNN in conformer (default: True).
- Type: bool
zero_triu
Whether to zero out the upper triangular part of the attention matrix (default: False).
- Type: bool
conformer_enc_kernel_size
Kernel size for the conformer encoder (default: 7).
- Type: int
conformer_dec_kernel_size
Kernel size for the conformer decoder (default: 31).
- Type: int
duration_predictor_layers
Number of layers in the duration predictor (default: 2).
- Type: int
duration_predictor_type
Type of duration predictor (default: “rnn”).
- Type: str
duration_predictor_units
Number of units in the duration predictor (default: 128).
- Type: int
spks
Number of speakers (default: None).
- Type: Optional[int]
langs
Number of languages (default: None).
- Type: Optional[int]
spk_embed_dim
Dimension of speaker embedding (default: None).
- Type: Optional[int]
spk_embed_integration_type
Type of speaker embedding integration (default: “add”).
- Type: str
init_type
Initialization type for the model (default: “xavier_uniform”).
- Type: str
init_enc_alpha
Initialization alpha for encoder (default: 1.0).
- Type: float
init_dec_alpha
Initialization alpha for decoder (default: 1.0).
- Type: float
use_masking
Whether to use masking during training (default: False).
- Type: bool
use_weighted_masking
Whether to use weighted masking during training (default: False).
Type: bool
Parameters:
- idim (int) – Input dimension for the model.
- odim (int) – Output dimension for the model.
- synthesizer_type (str , optional) – Type of synthesizer (default: “rnn”).
- layers (int , optional) – Number of layers in the synthesizer (default: 2).
- units (int , optional) – Number of units in each layer (default: 1024).
- prenet_layers (int , optional) – Number of layers in the prenet (default: 2).
- prenet_units (int , optional) – Number of units in the prenet (default: 128).
- prenet_dropout_rate (float , optional) – Dropout rate for the prenet (default: 0.5).
- postnet_layers (int , optional) – Number of layers in the postnet (default: 5).
- postnet_chans (int , optional) – Number of channels in the postnet (default: 512).
- postnet_dropout_rate (float , optional) – Dropout rate for the postnet (default: 0.5).
- adim (int , optional) – Dimension of the attention mechanism (default: 384).
- aheads (int , optional) – Number of attention heads (default: 4).
- conformer_rel_pos_type (str , optional) – Type of relative positional encoding (default: “legacy”).
- conformer_pos_enc_layer_type (str , optional) – Layer type for positional encoding (default: “rel_pos”).
- conformer_self_attn_layer_type (str , optional) – Layer type for self-attention (default: “rel_selfattn”).
- conformer_activation_type (str , optional) – Activation function type (default: “swish”).
- use_macaron_style_in_conformer (bool , optional) – Whether to use Macaron style in conformer (default: True).
- use_cnn_in_conformer (bool , optional) – Whether to use CNN in conformer (default: True).
- zero_triu (bool , optional) – Whether to zero out the upper triangular part of the attention matrix (default: False).
- conformer_enc_kernel_size (int , optional) – Kernel size for the conformer encoder (default: 7).
- conformer_dec_kernel_size (int , optional) – Kernel size for the conformer decoder (default: 31).
- duration_predictor_layers (int , optional) – Number of layers in the duration predictor (default: 2).
- duration_predictor_type (str , optional) – Type of duration predictor (default: “rnn”).
- duration_predictor_units (int , optional) – Number of units in the duration predictor (default: 128).
- spks (Optional *[*int ] , optional) – Number of speakers (default: None).
- langs (Optional *[*int ] , optional) – Number of languages (default: None).
- spk_embed_dim (Optional *[*int ] , optional) – Dimension of speaker embedding (default: None).
- spk_embed_integration_type (str , optional) – Type of speaker embedding integration (default: “add”).
- init_type (str , optional) – Initialization type for the model (default: “xavier_uniform”).
- init_enc_alpha (float , optional) – Initialization alpha for encoder (default: 1.0).
- init_dec_alpha (float , optional) – Initialization alpha for decoder (default: 1.0).
- use_masking (bool , optional) – Whether to use masking during training (default: False).
- use_weighted_masking (bool , optional) – Whether to use weighted masking during training (default: False).
Returns: None
Examples
>>> model = Translatotron2(idim=80, odim=80)
>>> print(model)
Translatotron2(...)
NOTE
This class is part of the ESPnet2 speech synthesis framework.
Initialize internal Module state, shared by both nn.Module and ScriptModule.