espnet2.mt.frontend.embedding.CodecEmbedding
espnet2.mt.frontend.embedding.CodecEmbedding
class espnet2.mt.frontend.embedding.CodecEmbedding(input_size, hf_model_tag: str = 'espnet/amuse_encodec_16k', token_bias: int = 2, token_per_frame: int = 8, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, positional_dropout_rate: float = 0.1)
Bases: AbsFrontend
Use codec dequantization process and the input embeddings.
This class implements a codec embedding layer that utilizes a pre-trained codec model for audio processing. It applies a dequantization process to the input embeddings, allowing for effective feature extraction from quantized audio data.
quantizer
The quantizer from the pre-trained codec model.
codebook_size
The size of the codebook used in the codec.
codebook_dim
The dimensionality of the codebook.
token_bias
The index of the first codec code.
token_per_frame
The number of tokens per frame in the input.
vocab_size
The size of the input vocabulary.
pos
Positional encoding layer.
ln
Layer normalization layer.
decoder
Decoder from the pre-trained codec model.
- Parameters:
- input_size – Size of the input vocabulary.
- hf_model_tag – HuggingFace model tag for Espnet codec models.
- token_bias – The index of the first codec code.
- token_per_frame – Number of tokens per frame in the input.
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding class.
- positional_dropout_rate – Dropout rate after adding positional encoding.
- Raises:AssertionError – If input dimensions or lengths are invalid.
######### Examples
>>> codec_embedding = CodecEmbedding(input_size=512)
>>> input_tensor = torch.randint(0, 512, (8, 64)) # (batch_size, seq_len)
>>> input_lengths = torch.full((8,), 64) # All sequences are of length 64
>>> output, lengths = codec_embedding(input_tensor, input_lengths)
NOTE
The input tensor must have dimensions of (B, T) where B is the batch size and T is the total number of tokens. Additionally, the length of input tensors must be divisible by token_per_frame.
Initialize.
- Parameters:
- hf_model_tag – HuggingFace model tag for Espnet codec models
- token_bias – the index of the first codec code
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
- positional_dropout_rate – dropout rate after adding positional encoding
forward(input: Tensor, input_lengths: Tensor)
Use codec dequantization process and the input embeddings.
This class implements an embedding frontend that utilizes a codec dequantization process to transform input embeddings. It incorporates positional encoding and layer normalization to process the input data.
hf_model_tag
HuggingFace model tag for Espnet codec models.
- Type: str
token_bias
The index of the first codec code.
- Type: int
token_per_frame
Number of tokens per frame in the input.
- Type: int
vocab_size
Size of the input vocabulary.
- Type: int
codebook_size
Size of the codec’s codebook.
- Type: int
codebook_dim
Dimension of the codec’s codebook.
- Type: int
pos
Positional encoding layer.
- Type: torch.nn.Module
ln
Layer normalization.
- Type: torch.nn.LayerNorm
decoder
Decoder from the codec model.
Type: torch.nn.Module
Parameters:
- input_size – Size of the input vocabulary.
- hf_model_tag (str) – HuggingFace model tag for Espnet codec models.
- token_bias (int) – The index of the first codec code.
- token_per_frame (int) – Number of tokens per frame in the input.
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding.
Raises:AssertionError – If the input tensor’s dimensions or values are invalid.
######### Examples
>>> codec_embedding = CodecEmbedding(input_size=400)
>>> input_tensor = torch.tensor([[0, 1, 2, 3], [4, 5, 6, 7]])
>>> input_lengths = torch.tensor([4, 4])
>>> output, output_lengths = codec_embedding(input_tensor, input_lengths)
NOTE
The class uses an external model for codec inference and requires that the model be pre-trained and available through the specified HuggingFace model tag.
output_size() → int
Return output length of feature dimension D, i.e. the embedding dim.
This method provides the dimensionality of the output features generated by the embedding layer. This is particularly useful for understanding the size of the data that will be passed to subsequent layers in the neural network.
- Returns: The dimensionality of the output features, which is equal to the embedding dimension.
- Return type: int
######### Examples
>>> embedding = CodecEmbedding(input_size=400)
>>> output_dim = embedding.output_size()
>>> print(output_dim)
400