espnet2.gan_codec.hificodec.hificodec.HiFiCodecGenerator

About 3 min

espnet2.gan_codec.hificodec.hificodec.HiFiCodecGenerator

class espnet2.gan_codec.hificodec.hificodec.HiFiCodecGenerator(sample_rate: int = 16000, hidden_dim: int = 128, resblock_num: str = '1', resblock_kernel_sizes: List[int] = [3, 7, 11], resblock_dilation_sizes: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], upsample_rates: List[int] = [8, 5, 4, 2], upsample_kernel_sizes: List[int] = [16, 11, 8, 4], upsample_initial_channel: int = 512, quantizer_n_q: int = 8, quantizer_bins: int = 1024, quantizer_decay: float = 0.99, quantizer_kmeans_init: bool = True, quantizer_kmeans_iters: int = 50, quantizer_threshold_ema_dead_code: int = 2, quantizer_target_bandwidth: List[float] = [7.5, 15])

Bases: Module

HiFiCodec generator module.

This class implements the generator for the HiFiCodec model, which processes audio waveforms through encoding and decoding mechanisms. The generator uses a combination of an encoder, a quantizer, and a decoder to achieve high-fidelity audio synthesis.

encoder

The encoder module that extracts features from the input audio.

Type:Encoder

quantizer

The quantization module that compresses the encoded features.

Type:GroupResidualVectorQuantization

decoder

The decoder module that reconstructs audio from quantized features.

Type:Generator

target_bandwidths

List of target bandwidths for quantization.

Type: List[float]

sample_rate

The sample rate of the audio.

Type: int

frame_rate

The frame rate derived from the sample rate and upsample rates.

Type: int
Parameters:
- sample_rate (int) – Sample rate of the input audio. Default is 16000.
- hidden_dim (int) – Dimensionality of hidden layers. Default is 128.
- resblock_num (str) – Number of residual blocks. Default is “1”.
- resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for residual blocks. Default is [3, 7, 11].
- resblock_dilation_sizes (List *[*List *[*int ] ]) – List of dilation sizes for residual blocks. Default is [[1, 3, 5], [1, 3, 5], [1, 3, 5]].
- upsample_rates (List *[*int ]) – List of upsample rates. Default is [8, 5, 4, 2].
- upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsampling. Default is [16, 11, 8, 4].
- upsample_initial_channel (int) – Number of initial channels for the upsampling layer. Default is 512.
- quantizer_n_q (int) – Number of quantizers. Default is 8.
- quantizer_bins (int) – Number of quantization bins. Default is 1024.
- quantizer_decay (float) – Decay rate for quantization. Default is 0.99.
- quantizer_kmeans_init (bool) – Whether to initialize with k-means. Default is True.
- quantizer_kmeans_iters (int) – Number of iterations for k-means. Default is 50.
- quantizer_threshold_ema_dead_code (int) – Threshold for dead code. Default is 2.
- quantizer_target_bandwidth (List *[*float ]) – List of target bandwidths for quantization. Default is [7.5, 15].

########### Examples

>>> generator = HiFiCodecGenerator()
>>> input_audio = torch.randn(1, 1, 16000)  # (B, 1, T)
>>> output = generator(input_audio)
>>> print(output[0].shape)  # Resynthesized audio shape
>>> print(output[1].shape)  # Commitment loss shape
>>> print(output[2].shape)  # Quantization loss shape
>>> print(output[3].shape)  # Resynthesized audio from encoder

Initialize HiFiCodec Generator. :param TODO:

decode(codes: Tensor)

Run decoding.

This method takes input codes generated by the encoder and produces a waveform. It is an essential step in the HiFiCodec pipeline, converting compressed representations back into audio signals.

Parameters:x (Tensor) – Input codes (T_code, N_stream). These are the codes produced by the encoder, which represent the compressed audio data.
Returns: Generated waveform (T_wav,). This is the output audio signal reconstructed from the input codes.
Return type: Tensor

########### Examples

>>> codec = HiFiCodec()
>>> input_codes = torch.randn(10, 8)  # Example input codes
>>> waveform = codec.decode(input_codes)
>>> print(waveform.shape)  # Output shape will depend on the model

####### NOTE The input codes should be of the shape (T_code, N_stream) where T_code is the length of the code sequence and N_stream is the number of streams used in the codec.

encode(x: Tensor, target_bw: float | None = None)

Run encoding.

Parameters:x (Tensor) – Input audio (T_wav,).
Returns: Generated codes (T_code, N_stream).
Return type: Tensor

########### Examples

>>> import torch
>>> model = HiFiCodec()
>>> input_audio = torch.randn(1, 16000)  # Example audio tensor
>>> encoded_codes = model.encode(input_audio)
>>> print(encoded_codes.shape)  # Output shape: (T_code, N_stream)

####### NOTE The input tensor should have the shape (B, T_wav) where B is the batch size and T_wav is the number of audio samples.

forward(x: Tensor, use_dual_decoder: bool = False)

Perform generator forward.

This method handles the forward pass for either the generator or discriminator, based on the forward_generator flag. It processes the input audio waveform tensor and computes the corresponding losses and statistics.

Parameters:
- audio (Tensor) – Audio waveform tensor of shape (B, T_wav).
- forward_generator (bool) – If True, the forward pass is done through the generator; otherwise, it goes through the discriminator.
Returns:
- loss (Tensor): Loss scalar tensor computed during the forward pass.
- stats (Dict[str, float]): Statistics for monitoring, including individual loss components.
- weight (Tensor): Weight tensor summarizing losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

########### Examples

>>> model = HiFiCodec()
>>> audio_input = torch.randn(8, 16000)  # Batch of 8, 1 second audio
>>> output = model.forward(audio_input, forward_generator=True)
>>> print(output['loss'].item())  # Access the computed loss

####### NOTE The method will call either _forward_generator or _forward_discrminator based on the value of forward_generator.