espnet2.tts.gst.style_encoder.StyleEncoder

About 1 min

espnet2.tts.gst.style_encoder.StyleEncoder

class espnet2.tts.gst.style_encoder.StyleEncoder(idim: int = 80, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, conv_layers: int = 6, conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), conv_kernel_size: int = 3, conv_stride: int = 2, gru_layers: int = 1, gru_units: int = 128)

Bases: Module

Style encoder of GST-Tacotron.

This module implements the style encoder introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.

ref_enc

Reference encoder module for feature extraction.

Type:ReferenceEncoder

stl

Style token layer for generating style embeddings.

Type:StyleTokenLayer
Parameters:
- idim (int , optional) – Dimension of the input mel-spectrogram. Default is 80.
- gst_tokens (int , optional) – The number of GST embeddings. Default is 10.
- gst_token_dim (int , optional) – Dimension of each GST embedding. Default is 256.
- gst_heads (int , optional) – The number of heads in GST multihead attention. Default is 4.
- conv_layers (int , optional) – The number of conv layers in the reference encoder. Default is 6.
- conv_chans_list (Sequence *[*int ] , optional) – List of the number of channels of conv layers in the reference encoder. Default is (32, 32, 64, 64, 128, 128).
- conv_kernel_size (int , optional) – Kernel size of conv layers in the reference encoder. Default is 3.
- conv_stride (int , optional) – Stride size of conv layers in the reference encoder. Default is 2.
- gru_layers (int , optional) – The number of GRU layers in the reference encoder. Default is 1.
- gru_units (int , optional) – The number of GRU units in the reference encoder. Default is 128.

####### Examples

Initialize the StyleEncoder with default parameters

style_encoder = StyleEncoder()

Forward pass with dummy input

dummy_input = torch.randn(4, 100, 80) # (B, Lmax, idim) style_embeddings = style_encoder(dummy_input) print(style_embeddings.shape) # Output shape should be (B, gst_token_dim)

Initilize global style encoder module.

forward(speech: Tensor) → Tensor

Calculate forward propagation.

This method computes the forward pass of the style encoder by taking the input speech features and producing style token embeddings.

Parameters:speech (Tensor) – Batch of padded target features with shape (B, Lmax, odim), where B is the batch size, Lmax is the maximum length of the sequence, and odim is the dimension of the output features.
Returns: Style token embeddings with shape (B, token_dim), where : token_dim is the dimension of the generated style tokens.
Return type: Tensor

####### Examples

>>> style_encoder = StyleEncoder()
>>> input_speech = torch.randn(16, 100, 80)  # Example input
>>> output_style_embs = style_encoder(input_speech)
>>> print(output_style_embs.shape)  # Should print: torch.Size([16, 256])