espnet2.tts.gst.style_encoder.StyleTokenLayer

Less than 1 minute

espnet2.tts.gst.style_encoder.StyleTokenLayer

class espnet2.tts.gst.style_encoder.StyleTokenLayer(ref_embed_dim: int = 128, gst_tokens: int = 10, gst_token_dim: int = 256, gst_heads: int = 4, dropout_rate: float = 0.0)

Bases: Module

Style token layer module.

This module is a style token layer introduced in Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.

Parameters:
- ref_embed_dim (int , optional) – Dimension of the input reference embedding.
- gst_tokens (int , optional) – The number of GST embeddings.
- gst_token_dim (int , optional) – Dimension of each GST embedding.
- gst_heads (int , optional) – The number of heads in GST multihead attention.
- dropout_rate (float , optional) – Dropout rate in multi-head attention.
Returns: Style token embeddings (B, gst_token_dim).
Return type: Tensor

####### Examples

>>> layer = StyleTokenLayer(ref_embed_dim=128, gst_tokens=10,
...                          gst_token_dim=256, gst_heads=4)
>>> ref_embs = torch.randn(32, 128)  # Batch of reference embeddings
>>> style_embs = layer(ref_embs)
>>> style_embs.shape
torch.Size([32, 256])  # Shape of style token embeddings

Initilize style token layer module.

forward(ref_embs: Tensor) → Tensor

Calculate forward propagation.

This method computes the forward pass of the StyleEncoder module. It takes a batch of padded target features and processes them to extract style token embeddings.

Parameters:speech (Tensor) – Batch of padded target features with shape (B, Lmax, odim), where B is the batch size, Lmax is the maximum sequence length, and odim is the dimension of the output.
Returns: Style token embeddings with shape (B, token_dim), where token_dim : is the dimension of the style token embeddings.
Return type: Tensor

####### Examples

>>> style_encoder = StyleEncoder()
>>> input_tensor = torch.randn(32, 100, 80)  # Example input
>>> output = style_encoder(input_tensor)
>>> print(output.shape)
torch.Size([32, 256])  # Example output shape