espnet2.gan_codec.hificodec.hificodec.HiFiCodecGenerator
espnet2.gan_codec.hificodec.hificodec.HiFiCodecGenerator
class espnet2.gan_codec.hificodec.hificodec.HiFiCodecGenerator(sample_rate: int = 16000, hidden_dim: int = 128, resblock_num: str = '1', resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilation_sizes: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], upsample_rates: List[int] = [8, 5, 4, 2], upsample_kernel_sizes: List[int] = [16, 11, 8, 4], upsample_initial_channel: int = 512, quantizer_n_q: int = 8, quantizer_bins: int = 1024, quantizer_decay: float = 0.99, quantizer_kmeans_init: bool = True, quantizer_kmeans_iters: int = 50, quantizer_threshold_ema_dead_code: int = 2, quantizer_target_bandwidth: List[float] = [7.5, 15])
Bases: Module
HiFiCodec generator module.
This class implements the generator for the HiFiCodec model, which processes audio waveforms through encoding and decoding mechanisms. The generator uses a combination of an encoder, a quantizer, and a decoder to achieve high-fidelity audio synthesis.
encoder
The encoder module that extracts features from the input audio.
- Type:Encoder
quantizer
The quantization module that compresses the encoded features.
decoder
The decoder module that reconstructs audio from quantized features.
- Type:Generator
target_bandwidths
List of target bandwidths for quantization.
- Type: List[float]
sample_rate
The sample rate of the audio.
- Type: int
frame_rate
The frame rate derived from the sample rate and upsample rates.
Type: int
Parameters:
- sample_rate (int) – Sample rate of the input audio. Default is 16000.
- hidden_dim (int) – Dimensionality of hidden layers. Default is 128.
- resblock_num (str) – Number of residual blocks. Default is “1”.
- resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks. Default is [3, 7, 11].
- resblock_dilation_sizes (List *[*List *[*int ] ]) – List of dilation sizes for residual blocks. Default is [[1, 3, 5], [1, 3, 5], [1, 3, 5]].
- upsample_rates (List *[*int ]) – List of upsample rates. Default is [8, 5, 4, 2].
- upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsampling. Default is [16, 11, 8, 4].
- upsample_initial_channel (int) – Number of initial channels for the upsampling layer. Default is 512.
- quantizer_n_q (int) – Number of quantizers. Default is 8.
- quantizer_bins (int) – Number of quantization bins. Default is 1024.
- quantizer_decay (float) – Decay rate for quantization. Default is 0.99.
- quantizer_kmeans_init (bool) – Whether to initialize with k-means. Default is True.
- quantizer_kmeans_iters (int) – Number of iterations for k-means. Default is 50.
- quantizer_threshold_ema_dead_code (int) – Threshold for dead code. Default is 2.
- quantizer_target_bandwidth (List *[*float ]) – List of target bandwidths for quantization. Default is [7.5, 15].
########### Examples
>>> generator = HiFiCodecGenerator()
>>> input_audio = torch.randn(1, 1, 16000) # (B, 1, T)
>>> output = generator(input_audio)
>>> print(output[0].shape) # Resynthesized audio shape
>>> print(output[1].shape) # Commitment loss shape
>>> print(output[2].shape) # Quantization loss shape
>>> print(output[3].shape) # Resynthesized audio from encoder
Initialize HiFiCodec Generator. :param TODO:
decode(codes: Tensor)
Run decoding.
This method takes input codes generated by the encoder and produces a waveform. It is an essential step in the HiFiCodec pipeline, converting compressed representations back into audio signals.
- Parameters:x (Tensor) – Input codes (T_code, N_stream). These are the codes produced by the encoder, which represent the compressed audio data.
- Returns: Generated waveform (T_wav,). This is the output audio signal reconstructed from the input codes.
- Return type: Tensor
########### Examples
>>> codec = HiFiCodec()
>>> input_codes = torch.randn(10, 8) # Example input codes
>>> waveform = codec.decode(input_codes)
>>> print(waveform.shape) # Output shape will depend on the model
####### NOTE The input codes should be of the shape (T_code, N_stream) where T_code is the length of the code sequence and N_stream is the number of streams used in the codec.
encode(x: Tensor, target_bw: float | None = None)
Run encoding.
- Parameters:x (Tensor) – Input audio (T_wav,).
- Returns: Generated codes (T_code, N_stream).
- Return type: Tensor
########### Examples
>>> import torch
>>> model = HiFiCodec()
>>> input_audio = torch.randn(1, 16000) # Example audio tensor
>>> encoded_codes = model.encode(input_audio)
>>> print(encoded_codes.shape) # Output shape: (T_code, N_stream)
####### NOTE The input tensor should have the shape (B, T_wav) where B is the batch size and T_wav is the number of audio samples.
forward(x: Tensor, use_dual_decoder: bool = False)
Perform generator forward.
This method handles the forward pass for either the generator or discriminator, based on the forward_generator flag. It processes the input audio waveform tensor and computes the corresponding losses and statistics.
- Parameters:
- audio (Tensor) – Audio waveform tensor of shape (B, T_wav).
- forward_generator (bool) – If True, the forward pass is done through the generator; otherwise, it goes through the discriminator.
- Returns:
- loss (Tensor): Loss scalar tensor computed during the forward pass.
- stats (Dict[str, float]): Statistics for monitoring, including individual loss components.
- weight (Tensor): Weight tensor summarizing losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
########### Examples
>>> model = HiFiCodec()
>>> audio_input = torch.randn(8, 16000) # Batch of 8, 1 second audio
>>> output = model.forward(audio_input, forward_generator=True)
>>> print(output['loss'].item()) # Access the computed loss
####### NOTE The method will call either _forward_generator or _forward_discrminator based on the value of forward_generator.