espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer

About 4 min

espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer

class espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer(codec_choice: str, codec_fs: int, device: str = 'cpu', dump_audio: bool = False, checkpoint_path: str | None = None, config_path: str | None = None, max_token_per_frame: int = 32)

Bases: AbsTokenizer

CodecTokenizer is a tokenizer implementation for various audio codecs.

This class provides methods for encoding and decoding audio waveforms using different codec implementations. It supports both discrete and continuous tokenization, allowing for flexible audio processing in speech language models.

Use cases: : - Use encode and decode for discrete (de)tokenization.

Use encode_continuous and decode_continuous for continuous (de)tokenization.
Use forward and detokenize for discrete (de)tokenization with flatten sequence style, which is more friendly for speechlm tasks.

codec_choice

The chosen codec implementation.

Type: str

device

The device for model computation (e.g., “cpu” or “cuda”).

Type: str

dump_audio

Flag to indicate whether to dump the audio during processing.

Type: bool

n_codebook

The number of codec codebooks.

Type: int

size_codebook

The dimension of codebooks.

Type: int

sample_rate

The sample rate the model was trained on.

Type: int

subsample

The subsample rate, a.k.a., frame shift.

Type: int
Parameters:
- codec_choice (str) – The codec implementation to use. Options include “ESPnet”, “DAC”, “EnCodec”, and “inhouse”.
- codec_fs (int) – The sample rate for the codec.
- device (str , optional) – The device to run the model on. Defaults to “cpu”.
- dump_audio (bool , optional) – Whether to dump the audio during processing. Defaults to False.
- checkpoint_path (str , optional) – Path to the model checkpoint file. Defaults to None.
- config_path (str , optional) – Path to the model configuration file. Defaults to None.
- max_token_per_frame (int , optional) – Maximum number of tokens per frame. Defaults to 32.
Raises:
- ValueError – If an unsupported codec choice is provided.
- ImportError – If the required codec library is not installed.

################# Examples

To initialize the CodecTokenizer and encode/decode audio waveforms:

``

python device = “cuda:0” codec = CodecTokenizer(

codec_choice=”ESPnet”, codec_fs=16000, device=device, dump_audio=True, checkpoint_path=”path/to/checkpoint.pth”, config_path=”path/to/config.yaml”,

)

Encode audio

waveform = torch.randn(1, 1, 16000) # Example waveform codes = codec.encode(waveform)

Decode audio

reconstructed_waveform = codec.decode(codes)

``

NOTE

The encode and decode methods are designed to work with audio tensors in specific shapes. Ensure the input tensors are formatted correctly.

Codec Tokenizer initialization

Each of the codec implementation should assign all following features: : self.n_codebook (int): the number of codec codebooks. self.size_codebook (int): the dimension of codebooks. self.sample_rate (int): the sample rate the model trained on. self.subsample (int): the subsample rate, a.k.a., frame shift.

decode(codes)

Recover the waveform from the codes.

Parameters:codes (torch.Tensor) – Int tensor in shape [B, T, n_codebook].
Returns: float tensor in shape [B, n_sample].
Return type: waveform (torch.Tensor)
Raises:NotImplementedError – If the codec_choice is not supported.

################# Examples

>>> tokenizer = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> codes = torch.randint(0, 256, (2, 10, 8))  # Example codes
>>> waveform = tokenizer.decode(codes)
>>> print(waveform.shape)  # Output shape: [2, n_sample]

decode_continuous(z)

Recover the waveform from the continuous representations of codec.

This method takes continuous representations (also known as latent variables) produced by the codec and reconstructs the audio waveform. It is particularly useful for processing audio data in a continuous form rather than discrete tokens.

Parameters:
- z (torch.Tensor) – Float tensor in shape [B, T, D], where B is the
- size (batch)
- steps (T is the number of time)
- dimension (and D is the)
- representations. (of the codec continuous)
Returns: Float tensor in shape [B, n_sample], representing the reconstructed audio waveform.
Return type: waveform (torch.Tensor)
Raises:NotImplementedError – If the codec choice is not supported.

################# Examples

>>> # Assuming 'codec' is an instance of CodecTokenizer
>>> z = torch.randn(2, 100, 512)  # Example continuous representations
>>> waveform = codec.decode_continuous(z)
>>> print(waveform.shape)
torch.Size([2, n_sample])  # n_sample will depend on the codec used

detokenize(codes, n_codebook=None)

Convert flatten codec codes into resynthesis the audio.

Parameters:
- codes (torch.Tensor) – int tensor in shape [B, T * n_codebook], or [T * n_codebook]. The flattened codec codes to be converted back into audio.
- n_codebook (int , optional) – The number of codebooks used for encoding. If not provided, the default number of codebooks from the instance will be used.
Returns: float tensor in shape [B, n_sample], : or [n_sample]. The resynthesized audio waveform from the provided codec codes.
Return type: waveform (torch.Tensor)
Raises:
- AssertionError – If the total number of tokens is not divisible
- by the number of codebooks. –

################# Examples

>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> flatten_codes = torch.randint(0, 256, (1, 32))  # Example codes
>>> audio_waveform = codec.detokenize(flatten_codes)
>>> print(audio_waveform.shape)
torch.Size([1, n_sample])

encode(wavs)

Convert audio waveforms into codec codes.

Parameters:wavs (torch.Tensor) – A float tensor of shape [B, 1, n_sample], where B is the batch size and n_sample is the number of audio samples.
Returns: An integer tensor of shape [B, T, n_codebook], : where T is the number of time frames produced by the encoding.
Return type: torch.Tensor
Raises:AssertionError – If the input tensor does not have 3 dimensions or if the second dimension is not equal to 1.

################# Examples

>>> import torch
>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> wavs = torch.randn(2, 1, 32000)  # Example batch of audio
>>> codes = codec.encode(wavs)
>>> print(codes.shape)
torch.Size([2, T, n_codebook])

encode_continuous(wavs)

Convert audio waveforms into continuous codec encoding results.

This method processes the input audio waveforms and converts them into continuous codec representations. The shape of the input tensor should be [B, 1, n_sample], where B is the batch size, and n_sample is the number of samples in the audio waveform. The output tensor will have the shape [B, T, D], where T is the number of time frames and D is the dimensionality of the continuous representation.

Parameters:wavs (torch.Tensor) – A float tensor of shape [B, 1, n_sample] representing the audio waveforms to be encoded.
Returns: A float tensor of shape [B, T, D] containing the continuous codec encoding results.
Return type: torch.Tensor
Raises:NotImplementedError – If the codec choice is not supported.

################# Examples

>>> import torch
>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> wavs = torch.randn(2, 1, 32000)  # Example input
>>> continuous_encoding = codec.encode_continuous(wavs)
>>> print(continuous_encoding.shape)
torch.Size([2, T, D])  # T and D depend on the codec implementation

forward(wavs)

Convert audio waveforms into flatten codec codes and resynthesize the audio.

This method processes input audio waveforms to generate a flattened representation of codec codes while optionally resynthesizing the audio. It combines encoding and decoding in a single step, which is particularly useful for speech-related tasks.

Parameters:wavs (torch.Tensor) – Float tensor in shape [B, 1, n_sample], where B is the batch size and n_sample is the number of audio samples.
Returns:
- codes (torch.Tensor): Int tensor in shape [B, T * n_codebook], representing the flattened codec codes.
- resyn_audio (torch.Tensor or None): Float tensor in shape [B, n_samples] if self.dump_audio is True, containing the resynthesized audio waveforms; otherwise, it returns None.
Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]

################# Examples

>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> wavs = torch.randn(2, 1, 16000)  # Example input tensor
>>> codes, resyn_audio = codec.forward(wavs)
>>> print(codes.shape)  # Should print shape [2, T * n_codebook]
>>> print(resyn_audio.shape)  # Shape depends on the decoding

NOTE

The method modifies the input codes by adding a shift based on the number of codebooks and their sizes before flattening them.