espnet2.gan_codec.hificodec.module.GroupResidualVectorQuantization
espnet2.gan_codec.hificodec.module.GroupResidualVectorQuantization
class espnet2.gan_codec.hificodec.module.GroupResidualVectorQuantization(quantizer_target_bandwidth, hidden_dim, quantizer_n_q, quantizer_bins, quantizer_decay, quantizer_kmeans_init, quantizer_kmeans_iters, quantizer_threshold_ema_dead_code, **kwargs)
Bases: Module
Group Residual Vector Quantization for audio codec.
This class implements a group residual vector quantization scheme, designed for encoding and decoding audio signals. It utilizes two residual vector quantizers to process input tensors, which can be split into two halves, facilitating independent quantization.
quantizer1
First residual vector quantizer instance.
quantizer0
Second residual vector quantizer instance.
l1_quantization_loss
L1 loss function for quantization loss calculation.
- Type: torch.nn.L1Loss
l2_quantization_loss
L2 loss function for quantization loss calculation.
- Type: torch.nn.MSELoss
target_bandwidths
Target bandwidths for quantization.
Type: List[float]
Parameters:
- quantizer_target_bandwidth (List *[*float ]) – List of target bandwidths.
- hidden_dim (int) – Dimension of the hidden states.
- quantizer_n_q (int) – Number of quantization levels.
- quantizer_bins (int) – Number of bins for quantization.
- quantizer_decay (float) – Decay factor for quantization.
- quantizer_kmeans_init (bool) – Whether to initialize with k-means.
- quantizer_kmeans_iters (int) – Number of k-means iterations.
- quantizer_threshold_ema_dead_code (float) – Threshold for dead codes.
Returns: A named tuple containing the quantized tensor, codes, bandwidth used, and penalty (if any).
Return type:QuantizedResult
########### Examples
Example usage of GroupResidualVectorQuantization
quantizer = GroupResidualVectorQuantization(
quantizer_target_bandwidth=[64.0], hidden_dim=512, quantizer_n_q=256, quantizer_bins=256, quantizer_decay=0.99, quantizer_kmeans_init=True, quantizer_kmeans_iters=10, quantizer_threshold_ema_dead_code=0.1
)
Encoding
input_tensor = torch.randn(8, 512, 16000) # Batch of audio samples encoded = quantizer.encode(input_tensor, frame_rate=16000)
Decoding
decoded = quantizer.decode(encoded)
######## NOTE The forward method requires input tensor xin with shape [B, T, D], where B is batch size, T is the sequence length, and D is the feature dimension.
- Raises:ValueError – If the input tensor dimensions are not as expected.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
decode(code: Tensor)
HiFICodec codec decoding.
This method takes neural codec representations and converts them back into resynthesized audio signals. The input codes are split into two parts, which are processed by two separate quantizers to reconstruct the original audio waveform.
- Parameters:code (torch.Tensor) – Neural codecs in shape (B, N), where B is the batch size and N is the number of codes.
- Returns: Resynthesized audio of shape (B, T, D), where T is the : length of the audio signal and D is the number of channels.
- Return type: torch.Tensor
########### Examples
>>> import torch
>>> codec = GroupResidualVectorQuantization(128, 512, 256, 256, 0.99, True, 10, 0.1)
>>> code = torch.randn(4, 256) # Example input
>>> audio = codec.decode(code)
>>> print(audio.shape) # Output shape will be (4, T, D)
######## NOTE The shape of the input tensor must match the expected shape for the decoding process to work correctly.
encode(xin: Tensor, frame_rate: int, target_bw: float | None = None)
Encode input tensor using HiFICodec codec.
This method performs encoding on the input tensor xin by splitting it into two parts, quantizing each part, and concatenating the resulting codes. The quantization is performed based on the specified frame rate and target bandwidth.
- Parameters:
- xin (torch.Tensor) – Input tensor of shape (B, 1, T) where B is the batch size and T is the sequence length.
- frame_rate (int) – Frame rate to be used during encoding.
- target_bw (Optional *[*float ]) – Target bandwidth for quantization. If None, the last value from self.target_bandwidths is used.
- Returns: Concatenated neural codes from the quantization of : both parts of the input tensor.
- Return type: torch.Tensor
########### Examples
>>> encoder = GroupResidualVectorQuantization(...)
>>> input_tensor = torch.randn(4, 1, 1024) # Batch of 4, 1 channel
>>> encoded_codes = encoder.encode(input_tensor, frame_rate=16000)
######## NOTE The input tensor is expected to have a shape of (B, 1, T) and will be split into two equal parts for processing.
forward(xin: Tensor, sample_rate: int, bandwidth: float | None = None) → QuantizedResult
Forward pass for the GroupResidualVectorQuantization model.
This method takes an input tensor, applies quantization using two residual vector quantizers, and computes the associated quantization losses. The input tensor is expected to have shape (B, T, D) where:
- B: Batch size
- T: Number of time steps
- D: Number of dimensions (features)
- Parameters:
- xin (torch.Tensor) – Input tensor of shape (B, T, D).
- sample_rate (int) – Sample rate of the input signal.
- bandwidth (Optional *[*float ]) – Desired bandwidth for quantization. If not provided, the default target bandwidth is used.
- Returns: A named tuple containing the following fields: : - quantized (torch.Tensor): The quantized output tensor.
- codes (torch.Tensor): The codes generated by the quantizer.
- bandwidth (torch.Tensor): The bandwidth in kb/s used per batch item.
- penalty (Optional[torch.Tensor]): Optional penalty for quantization.
- Return type:QuantizedResult
########### Examples
>>> model = GroupResidualVectorQuantization(...)
>>> input_tensor = torch.randn(4, 1024, 512) # Example input
>>> result = model.forward(input_tensor, sample_rate=22050)
>>> print(result.quantized.shape) # Output shape of quantized tensor
######## NOTE The quantization loss is computed as a combination of L1 and L2 losses for both quantizers. This is crucial for optimizing the quantization process.
- Raises:ValueError – If the input tensor shape is not compatible.