espnet2.tts.gst.style_encoder.StyleTokenLayer
Less than 1 minute
espnet2.tts.gst.style_encoder.StyleTokenLayer
class espnet2.tts.gst.style_encoder.StyleTokenLayer(ref_embed_dim: int = 128, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, dropout_rate: float = 0.0)
Bases: Module
Style token layer module.
This module is a style token layer introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.
- Parameters:
- ref_embed_dim (int , optional) – Dimension of the input reference embedding.
- gst_tokens (int , optional) – The number of GST embeddings.
- gst_token_dim (int , optional) – Dimension of each GST embedding.
- gst_heads (int , optional) – The number of heads in GST multihead attention.
- dropout_rate (float , optional) – Dropout rate in multi-head attention.
- Returns: Style token embeddings (B, gst_token_dim).
- Return type: Tensor
####### Examples
>>> layer = StyleTokenLayer(ref_embed_dim=128, gst_tokens=10,
... gst_token_dim=256, gst_heads=4)
>>> ref_embs = torch.randn(32, 128) # Batch of reference embeddings
>>> style_embs = layer(ref_embs)
>>> style_embs.shape
torch.Size([32, 256]) # Shape of style token embeddings
Initilize style token layer module.
forward(ref_embs: Tensor) → Tensor
Calculate forward propagation.
This method computes the forward pass of the StyleEncoder module. It takes a batch of padded target features and processes them to extract style token embeddings.
- Parameters:speech (Tensor) – Batch of padded target features with shape (B, Lmax, odim), where B is the batch size, Lmax is the maximum sequence length, and odim is the dimension of the output.
- Returns: Style token embeddings with shape (B, token_dim), where token_dim : is the dimension of the style token embeddings.
- Return type: Tensor
####### Examples
>>> style_encoder = StyleEncoder()
>>> input_tensor = torch.randn(32, 100, 80) # Example input
>>> output = style_encoder(input_tensor)
>>> print(output.shape)
torch.Size([32, 256]) # Example output shape