espnet2.tts.gst.style_encoder.StyleEncoder
espnet2.tts.gst.style_encoder.StyleEncoder
class espnet2.tts.gst.style_encoder.StyleEncoder(idim: int = 80, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, conv_layers: int = 6, conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), conv_kernel_size: int = 3, conv_stride: int = 2, gru_layers: int = 1, gru_units: int = 128)
Bases: Module
Style encoder of GST-Tacotron.
This module implements the style encoder introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.
ref_enc
Reference encoder module for feature extraction.
- Type:ReferenceEncoder
stl
Style token layer for generating style embeddings.
Type:StyleTokenLayer
Parameters:
- idim (int , optional) – Dimension of the input mel-spectrogram. Default is 80.
- gst_tokens (int , optional) – The number of GST embeddings. Default is 10.
- gst_token_dim (int , optional) – Dimension of each GST embedding. Default is 256.
- gst_heads (int , optional) – The number of heads in GST multihead attention. Default is 4.
- conv_layers (int , optional) – The number of conv layers in the reference encoder. Default is 6.
- conv_chans_list (Sequence *[*int ] , optional) – List of the number of channels of conv layers in the reference encoder. Default is (32, 32, 64, 64, 128, 128).
- conv_kernel_size (int , optional) – Kernel size of conv layers in the reference encoder. Default is 3.
- conv_stride (int , optional) – Stride size of conv layers in the reference encoder. Default is 2.
- gru_layers (int , optional) – The number of GRU layers in the reference encoder. Default is 1.
- gru_units (int , optional) – The number of GRU units in the reference encoder. Default is 128.
####### Examples
Initialize the StyleEncoder with default parameters
style_encoder = StyleEncoder()
Forward pass with dummy input
dummy_input = torch.randn(4, 100, 80) # (B, Lmax, idim) style_embeddings = style_encoder(dummy_input) print(style_embeddings.shape) # Output shape should be (B, gst_token_dim)
Initilize global style encoder module.
forward(speech: Tensor) → Tensor
Calculate forward propagation.
This method computes the forward pass of the style encoder by taking the input speech features and producing style token embeddings.
- Parameters:speech (Tensor) – Batch of padded target features with shape (B, Lmax, odim), where B is the batch size, Lmax is the maximum length of the sequence, and odim is the dimension of the output features.
- Returns: Style token embeddings with shape (B, token_dim), where : token_dim is the dimension of the generated style tokens.
- Return type: Tensor
####### Examples
>>> style_encoder = StyleEncoder()
>>> input_speech = torch.randn(16, 100, 80) # Example input
>>> output_style_embs = style_encoder(input_speech)
>>> print(output_style_embs.shape) # Should print: torch.Size([16, 256])