espnet2.mt.frontend.embedding.CodecEmbedding

About 3 min

espnet2.mt.frontend.embedding.CodecEmbedding

class espnet2.mt.frontend.embedding.CodecEmbedding(input_size, hf_model_tag: str = 'espnet/amuse_encodec_16k', token_bias: int = 2, token_per_frame: int = 8, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, positional_dropout_rate: float = 0.1)

Bases: AbsFrontend

Use codec dequantization process and the input embeddings.

This class implements a codec embedding layer that utilizes a pre-trained codec model for audio processing. It applies a dequantization process to the input embeddings, allowing for effective feature extraction from quantized audio data.

quantizer

The quantizer from the pre-trained codec model.

codebook_size

The size of the codebook used in the codec.

codebook_dim

The dimensionality of the codebook.

token_bias

The index of the first codec code.

token_per_frame

The number of tokens per frame in the input.

vocab_size

The size of the input vocabulary.

pos

Positional encoding layer.

Layer normalization layer.

decoder

Decoder from the pre-trained codec model.

Parameters:
- input_size – Size of the input vocabulary.
- hf_model_tag – HuggingFace model tag for Espnet codec models.
- token_bias – The index of the first codec code.
- token_per_frame – Number of tokens per frame in the input.
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding class.
- positional_dropout_rate – Dropout rate after adding positional encoding.
Raises:AssertionError – If input dimensions or lengths are invalid.

######### Examples

>>> codec_embedding = CodecEmbedding(input_size=512)
>>> input_tensor = torch.randint(0, 512, (8, 64))  # (batch_size, seq_len)
>>> input_lengths = torch.full((8,), 64)  # All sequences are of length 64
>>> output, lengths = codec_embedding(input_tensor, input_lengths)

NOTE

The input tensor must have dimensions of (B, T) where B is the batch size and T is the total number of tokens. Additionally, the length of input tensors must be divisible by token_per_frame.

Initialize.

Parameters:
- hf_model_tag – HuggingFace model tag for Espnet codec models
- token_bias – the index of the first codec code
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
- positional_dropout_rate – dropout rate after adding positional encoding

forward(input: Tensor, input_lengths: Tensor)

Use codec dequantization process and the input embeddings.

This class implements an embedding frontend that utilizes a codec dequantization process to transform input embeddings. It incorporates positional encoding and layer normalization to process the input data.

hf_model_tag

HuggingFace model tag for Espnet codec models.

Type: str

token_bias

The index of the first codec code.

Type: int

token_per_frame

Number of tokens per frame in the input.

Type: int

vocab_size

Size of the input vocabulary.

Type: int

codebook_size

Size of the codec’s codebook.

Type: int

codebook_dim

Dimension of the codec’s codebook.

Type: int

pos

Positional encoding layer.

Type: torch.nn.Module

Layer normalization.

Type: torch.nn.LayerNorm

decoder

Decoder from the codec model.

Type: torch.nn.Module
Parameters:
- input_size – Size of the input vocabulary.
- hf_model_tag (str) – HuggingFace model tag for Espnet codec models.
- token_bias (int) – The index of the first codec code.
- token_per_frame (int) – Number of tokens per frame in the input.
- pos_enc_class – PositionalEncoding or ScaledPositionalEncoding.
- positional_dropout_rate (float) – Dropout rate after adding positional encoding.
Raises:AssertionError – If the input tensor’s dimensions or values are invalid.

######### Examples

>>> codec_embedding = CodecEmbedding(input_size=400)
>>> input_tensor = torch.tensor([[0, 1, 2, 3], [4, 5, 6, 7]])
>>> input_lengths = torch.tensor([4, 4])
>>> output, output_lengths = codec_embedding(input_tensor, input_lengths)

NOTE

The class uses an external model for codec inference and requires that the model be pre-trained and available through the specified HuggingFace model tag.

output_size() → int

Return output length of feature dimension D, i.e. the embedding dim.

This method provides the dimensionality of the output features generated by the embedding layer. This is particularly useful for understanding the size of the data that will be passed to subsequent layers in the neural network.

Returns: The dimensionality of the output features, which is equal to the embedding dimension.
Return type: int

######### Examples

>>> embedding = CodecEmbedding(input_size=400)
>>> output_dim = embedding.output_size()
>>> print(output_dim)
400