espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer
espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer
class espnet2.speechlm.tokenizer.codec_tokenizer.CodecTokenizer(codec_choice: str, codec_fs: int, device: str = 'cpu', dump_audio: bool = False, checkpoint_path: str | None = None, config_path: str | None = None, max_token_per_frame: int = 32)
Bases: AbsTokenizer
CodecTokenizer is a tokenizer implementation for various audio codecs.
This class provides methods for encoding and decoding audio waveforms using different codec implementations. It supports both discrete and continuous tokenization, allowing for flexible audio processing in speech language models.
Use cases: : - Use encode and decode for discrete (de)tokenization.
- Use encode_continuous and decode_continuous for continuous (de)tokenization.
- Use forward and detokenize for discrete (de)tokenization with flatten sequence style, which is more friendly for speechlm tasks.
codec_choice
The chosen codec implementation.
- Type: str
device
The device for model computation (e.g., “cpu” or “cuda”).
- Type: str
dump_audio
Flag to indicate whether to dump the audio during processing.
- Type: bool
n_codebook
The number of codec codebooks.
- Type: int
size_codebook
The dimension of codebooks.
- Type: int
sample_rate
The sample rate the model was trained on.
- Type: int
subsample
The subsample rate, a.k.a., frame shift.
Type: int
Parameters:
- codec_choice (str) – The codec implementation to use. Options include “ESPnet”, “DAC”, “EnCodec”, and “inhouse”.
- codec_fs (int) – The sample rate for the codec.
- device (str , optional) – The device to run the model on. Defaults to “cpu”.
- dump_audio (bool , optional) – Whether to dump the audio during processing. Defaults to False.
- checkpoint_path (str , optional) – Path to the model checkpoint file. Defaults to None.
- config_path (str , optional) – Path to the model configuration file. Defaults to None.
- max_token_per_frame (int , optional) – Maximum number of tokens per frame. Defaults to 32.
Raises:
- ValueError – If an unsupported codec choice is provided.
- ImportError – If the required codec library is not installed.
################# Examples
To initialize the CodecTokenizer and encode/decode audio waveforms:
``
`
python device = “cuda:0” codec = CodecTokenizer(
codec_choice=”ESPnet”, codec_fs=16000, device=device, dump_audio=True, checkpoint_path=”path/to/checkpoint.pth”, config_path=”path/to/config.yaml”,
)
Encode audio
waveform = torch.randn(1, 1, 16000) # Example waveform codes = codec.encode(waveform)
Decode audio
reconstructed_waveform = codec.decode(codes)
``
`
NOTE
The encode and decode methods are designed to work with audio tensors in specific shapes. Ensure the input tensors are formatted correctly.
Codec Tokenizer initialization
Each of the codec implementation should assign all following features: : self.n_codebook (int): the number of codec codebooks. self.size_codebook (int): the dimension of codebooks. self.sample_rate (int): the sample rate the model trained on. self.subsample (int): the subsample rate, a.k.a., frame shift.
decode(codes)
Recover the waveform from the codes.
- Parameters:codes (torch.Tensor) – Int tensor in shape [B, T, n_codebook].
- Returns: float tensor in shape [B, n_sample].
- Return type: waveform (torch.Tensor)
- Raises:NotImplementedError – If the codec_choice is not supported.
################# Examples
>>> tokenizer = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> codes = torch.randint(0, 256, (2, 10, 8)) # Example codes
>>> waveform = tokenizer.decode(codes)
>>> print(waveform.shape) # Output shape: [2, n_sample]
decode_continuous(z)
Recover the waveform from the continuous representations of codec.
This method takes continuous representations (also known as latent variables) produced by the codec and reconstructs the audio waveform. It is particularly useful for processing audio data in a continuous form rather than discrete tokens.
- Parameters:
- z (torch.Tensor) – Float tensor in shape [B, T, D], where B is the
- size (batch)
- steps (T is the number of time)
- dimension (and D is the)
- representations. (of the codec continuous)
- Returns: Float tensor in shape [B, n_sample], representing the reconstructed audio waveform.
- Return type: waveform (torch.Tensor)
- Raises:NotImplementedError – If the codec choice is not supported.
################# Examples
>>> # Assuming 'codec' is an instance of CodecTokenizer
>>> z = torch.randn(2, 100, 512) # Example continuous representations
>>> waveform = codec.decode_continuous(z)
>>> print(waveform.shape)
torch.Size([2, n_sample]) # n_sample will depend on the codec used
detokenize(codes, n_codebook=None)
Convert flatten codec codes into resynthesis the audio.
- Parameters:
- codes (torch.Tensor) – int tensor in shape [B, T * n_codebook], or [T * n_codebook]. The flattened codec codes to be converted back into audio.
- n_codebook (int , optional) – The number of codebooks used for encoding. If not provided, the default number of codebooks from the instance will be used.
- Returns: float tensor in shape [B, n_sample], : or [n_sample]. The resynthesized audio waveform from the provided codec codes.
- Return type: waveform (torch.Tensor)
- Raises:
- AssertionError – If the total number of tokens is not divisible
- by the number of codebooks. –
################# Examples
>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> flatten_codes = torch.randint(0, 256, (1, 32)) # Example codes
>>> audio_waveform = codec.detokenize(flatten_codes)
>>> print(audio_waveform.shape)
torch.Size([1, n_sample])
encode(wavs)
Convert audio waveforms into codec codes.
- Parameters:wavs (torch.Tensor) – A float tensor of shape [B, 1, n_sample], where B is the batch size and n_sample is the number of audio samples.
- Returns: An integer tensor of shape [B, T, n_codebook], : where T is the number of time frames produced by the encoding.
- Return type: torch.Tensor
- Raises:AssertionError – If the input tensor does not have 3 dimensions or if the second dimension is not equal to 1.
################# Examples
>>> import torch
>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> wavs = torch.randn(2, 1, 32000) # Example batch of audio
>>> codes = codec.encode(wavs)
>>> print(codes.shape)
torch.Size([2, T, n_codebook])
encode_continuous(wavs)
Convert audio waveforms into continuous codec encoding results.
This method processes the input audio waveforms and converts them into continuous codec representations. The shape of the input tensor should be [B, 1, n_sample], where B is the batch size, and n_sample is the number of samples in the audio waveform. The output tensor will have the shape [B, T, D], where T is the number of time frames and D is the dimensionality of the continuous representation.
- Parameters:wavs (torch.Tensor) – A float tensor of shape [B, 1, n_sample] representing the audio waveforms to be encoded.
- Returns: A float tensor of shape [B, T, D] containing the continuous codec encoding results.
- Return type: torch.Tensor
- Raises:NotImplementedError – If the codec choice is not supported.
################# Examples
>>> import torch
>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> wavs = torch.randn(2, 1, 32000) # Example input
>>> continuous_encoding = codec.encode_continuous(wavs)
>>> print(continuous_encoding.shape)
torch.Size([2, T, D]) # T and D depend on the codec implementation
forward(wavs)
Convert audio waveforms into flatten codec codes and resynthesize the audio.
This method processes input audio waveforms to generate a flattened representation of codec codes while optionally resynthesizing the audio. It combines encoding and decoding in a single step, which is particularly useful for speech-related tasks.
- Parameters:wavs (torch.Tensor) – Float tensor in shape [B, 1, n_sample], where B is the batch size and n_sample is the number of audio samples.
- Returns:
- codes (torch.Tensor): Int tensor in shape [B, T * n_codebook], representing the flattened codec codes.
- resyn_audio (torch.Tensor or None): Float tensor in shape [B, n_samples] if self.dump_audio is True, containing the resynthesized audio waveforms; otherwise, it returns None.
- Return type: Tuple[torch.Tensor, Optional[torch.Tensor]]
################# Examples
>>> codec = CodecTokenizer(codec_choice="ESPnet", codec_fs=16000)
>>> wavs = torch.randn(2, 1, 16000) # Example input tensor
>>> codes, resyn_audio = codec.forward(wavs)
>>> print(codes.shape) # Should print shape [2, T * n_codebook]
>>> print(resyn_audio.shape) # Shape depends on the decoding
NOTE
The method modifies the input codes by adding a shift based on the number of codebooks and their sizes before flattening them.