espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer
espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer
class espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet2.legacy.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'concat')
Bases: AbsSynthesizer
UnitY Synthesizer related modules for speech-to-speech translation.
This is a module of discrete unit prediction network in discrete-unit described in
`Direct speech-to-speech translation with discrete units`_, which converts the sequence of hidden states into the sequence of discrete unit (from SSLs).
Transfomer decoder for discrete unit module.
- Parameters:
- vocab_size β output dim
- encoder_output_size β dimension of attention
- attention_heads β the number of heads of multi head attention
- linear_units β the number of units of position-wise feed forward
- num_blocks β the number of decoder blocks
- dropout_rate β dropout rate
- self_attention_dropout_rate β dropout rate for attention
- input_layer β input layer type
- use_output_layer β whether to use output layer
- pos_enc_class β PositionalEncoding or ScaledPositionalEncoding
- normalize_before β whether to use layer_norm before the first block
- concat_after β whether to concat attention layerβs input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
- spks (Optional *[*int ]) β Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) β Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) β Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) β How to integrate speaker embedding.
forward(enc_outputs: Tensor, enc_outputs_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, return_last_hidden: bool = False, return_all_hiddens: bool = False) β Tuple[Tensor, Tensor]
Calculate forward propagation.
- Parameters:
- enc_outputs (LongTensor) β Batch of padded character ids (B, T, idim).
- enc_outputs_lengths (LongTensor) β Batch of lengths of each input batch (B,).
- feats (Tensor) β Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) β Batch of the lengths of each target (B,).
- spembs (Optional *[*Tensor ]) β Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) β Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) β Batch of language IDs (B, 1).
- Returns: hs hlens
