espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer

About 3 min

espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer

class espnet2.s2st.synthesizer.unity_synthesizer.UnitYSynthesizer(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, layer_drop_rate: float = 0.0, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, spk_embed_integration_type: str = 'concat')

Bases: AbsSynthesizer

UnitY Synthesizer related modules for speech-to-speech translation.

This is a module of discrete unit prediction network in discrete-unit described in

`Direct speech-to-speech translation with discrete units`_

, which converts the sequence of hidden states into the sequence of discrete units (from SSLs).

spks

Number of speakers. If set to > 1, assume that the speaker IDs will be provided as input and use speaker embedding layer.

Type: Optional[int]

langs

Number of languages. If set to > 1, assume that the language IDs will be provided as input and use language embedding layer.

Type: Optional[int]

spk_embed_dim

Speaker embedding dimension. If set to > 0, assume that speaker embeddings will be provided as input.

Type: Optional[int]

sid_emb

Embedding layer for speaker IDs.

Type: torch.nn.Embedding

lid_emb

Embedding layer for language IDs.

Type: torch.nn.Embedding

decoder

Transformer decoder for discrete unit module.

Type:TransformerDecoder
Parameters:
- vocab_size (int) – Output dimension.
- encoder_output_size (int) – Dimension of attention.
- attention_heads (int , optional) – Number of heads in multi-head attention. Defaults to 4.
- linear_units (int , optional) – Number of units in position-wise feed-forward. Defaults to 2048.
- num_blocks (int , optional) – Number of decoder blocks. Defaults to 6.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.1.
- positional_dropout_rate (float , optional) – Dropout rate for positional encoding. Defaults to 0.1.
- self_attention_dropout_rate (float , optional) – Dropout rate for self-attention. Defaults to 0.0.
- src_attention_dropout_rate (float , optional) – Dropout rate for source attention. Defaults to 0.0.
- input_layer (str , optional) – Input layer type. Defaults to “embed”.
- use_output_layer (bool , optional) – Whether to use output layer. Defaults to True.
- pos_enc_class (type , optional) – PositionalEncoding or ScaledPositionalEncoding.
- normalize_before (bool , optional) – Whether to use layer norm before the first block. Defaults to True.
- concat_after (bool , optional) – Whether to concatenate attention layer’s input and output. Defaults to False.
- layer_drop_rate (float , optional) – Layer drop rate. Defaults to 0.0.
- spks (Optional *[*int ] , optional) – Number of speakers. Defaults to None.
- langs (Optional *[*int ] , optional) – Number of languages. Defaults to None.
- spk_embed_dim (Optional *[*int ] , optional) – Speaker embedding dimension. Defaults to None.
- spk_embed_integration_type (str , optional) – How to integrate speaker embedding. Defaults to “concat”.

####### Examples

Initialize the synthesizer

synthesizer = UnitYSynthesizer(

vocab_size=5000, encoder_output_size=256, spks=2, langs=3, spk_embed_dim=64, spk_embed_integration_type=”add”

)

Forward pass

enc_outputs = torch.randn(32, 100, 256) # (batch_size, max_time, enc_dim) enc_outputs_lengths = torch.randint(1, 101, (32,)) # (batch_size,) feats = torch.randn(32, 50, 256) # (batch_size, max_time, feat_dim) feats_lengths = torch.randint(1, 51, (32,)) # (batch_size,)

hs, hlens = synthesizer.forward(enc_outputs, enc_outputs_lengths, feats, feats_lengths)

NOTE

The integration of speaker embeddings can be done through either concatenation or addition based on the specified integration type.

Raises:
- ValueError – If the specified speaker embedding integration type is not supported.
- NotImplementedError – If the integration type is not “add” or “concat”.

Transfomer decoder for discrete unit module.

Parameters:
- vocab_size – output dim
- encoder_output_size – dimension of attention
- attention_heads – the number of heads of multi head attention
- linear_units – the number of units of position-wise feed forward
- num_blocks – the number of decoder blocks
- dropout_rate – dropout rate
- self_attention_dropout_rate – dropout rate for attention
- input_layer – input layer type
- use_output_layer – whether to use output layer
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
- normalize_before – whether to use layer_norm before the first block
- concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- spk_embed_integration_type (str) – How to integrate speaker embedding.

forward(enc_outputs: Tensor, enc_outputs_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, return_last_hidden: bool = False, return_all_hiddens: bool = False) → Tuple[Tensor, Tensor]

Calculate forward propagation.

This method performs the forward pass of the UnitYSynthesizer, processing the encoder outputs, target features, and optional speaker and language embeddings to generate hidden states and their lengths.

Parameters:
- enc_outputs (torch.Tensor) – Batch of padded character ids (B, T, idim).
- enc_outputs_lengths (torch.Tensor) – Batch of lengths of each input batch (B,).
- feats (torch.Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (torch.Tensor) – Batch of the lengths of each target (B,).
- spembs (Optional *[*torch.Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*torch.Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*torch.Tensor ]) – Batch of language IDs (B, 1).
- return_last_hidden (bool , optional) – Whether to return the last hidden state. Defaults to False.
- return_all_hiddens (bool , optional) – Whether to return all hidden states. Defaults to False.
Returns: A tuple containing: : - hs (torch.Tensor): Hidden states.
- hlens (torch.Tensor): Lengths of the hidden states.
Return type: Tuple[torch.Tensor, torch.Tensor]

####### Examples

Example of calling forward method

hs, hlens = synthesizer.forward(enc_outputs, enc_outputs_lengths,

feats, feats_lengths, spembs=spembs, sids=sids, lids=lids)

NOTE

Ensure that the dimensions of the inputs match the expected sizes as defined in the method arguments.