espnet2.gan_codec.dac.dac.DACGenerator

About 3 min

espnet2.gan_codec.dac.dac.DACGenerator

class espnet2.gan_codec.dac.dac.DACGenerator(sample_rate: int = 24000, hidden_dim: int = 128, codebook_dim: int = 8, encdec_channels: int = 1, encdec_n_filters: int = 32, encdec_n_residual_layers: int = 1, encdec_ratios: List[int] = [8, 5, 4, 2], encdec_activation: str = 'Snake', encdec_activation_params: Dict[str, Any] = {}, encdec_norm: str = 'weight_norm', encdec_norm_params: Dict[str, Any] = {}, encdec_kernel_size: int = 7, encdec_residual_kernel_size: int = 7, encdec_last_kernel_size: int = 7, encdec_dilation_base: int = 2, encdec_causal: bool = False, encdec_pad_mode: str = 'reflect', encdec_true_skip: bool = False, encdec_compress: int = 2, encdec_lstm: int = 2, decoder_trim_right_ratio: float = 1.0, decoder_final_activation: str | None = None, decoder_final_activation_params: dict | None = None, quantizer_n_q: int = 8, quantizer_bins: int = 1024, quantizer_decay: float = 0.99, quantizer_kmeans_init: bool = True, quantizer_kmeans_iters: int = 50, quantizer_threshold_ema_dead_code: int = 2, quantizer_target_bandwidth: List[float] = [7.5, 15], quantizer_dropout: bool = True)

Bases: Module

DAC generator module.

This module implements the generator for the DAC (Discrete Audio Codec) model. It utilizes an encoder-decoder architecture with quantization to generate audio waveforms from input tensors. The generator is designed to be flexible, allowing for various configurations of encoder and decoder parameters.

encoder

The encoder component of the DAC generator.

Type:SEANetEncoder

quantizer

The quantizer for encoding.

Type:ResidualVectorQuantizer

target_bandwidths

List of target bandwidths for quantization.

Type: List[float]

sample_rate

The sample rate of the audio.

Type: int

frame_rate

The frame rate calculated from the sample rate and encoder-decoder ratios.

Type: int

decoder

The decoder component of the DAC generator.

Type:SEANetDecoder

l1_quantization_loss

L1 loss for quantization.

Type: torch.nn.L1Loss

l2_quantization_loss

L2 loss for quantization.

Type: torch.nn.MSELoss
Parameters:
- sample_rate (int) – The sample rate of the audio (default: 24000).
- hidden_dim (int) – Dimension of hidden layers (default: 128).
- codebook_dim (int) – Dimension of the codebook for quantization (default: 8).
- encdec_channels (int) – Number of channels for encoder/decoder (default: 1).
- encdec_n_filters (int) – Number of filters for encoder/decoder (default: 32).
- encdec_n_residual_layers (int) – Number of residual layers (default: 1).
- encdec_ratios (List *[*int ]) – Ratios for downsampling (default: [8, 5, 4, 2]).
- encdec_activation (str) – Activation function used (default: “Snake”).
- encdec_activation_params (Dict *[*str , Any ]) – Parameters for activation function (default: {}).
- encdec_norm (str) – Normalization method used (default: “weight_norm”).
- encdec_norm_params (Dict *[*str , Any ]) – Parameters for normalization (default: {}).
- encdec_kernel_size (int) – Kernel size for convolution layers (default: 7).
- encdec_residual_kernel_size (int) – Kernel size for residual connections (default: 7).
- encdec_last_kernel_size (int) – Kernel size for the last layer (default: 7).
- encdec_dilation_base (int) – Dilation base for convolution layers (default: 2).
- encdec_causal (bool) – Whether to use causal convolutions (default: False).
- encdec_pad_mode (str) – Padding mode for convolutions (default: “reflect”).
- encdec_true_skip (bool) – Whether to use true skip connections (default: False).
- encdec_compress (int) – Compression factor for the encoder (default: 2).
- encdec_lstm (int) – Number of LSTM layers (default: 2).
- decoder_trim_right_ratio (float) – Trim ratio for the decoder output (default: 1.0).
- decoder_final_activation (Optional *[*str ]) – Final activation function for the decoder (default: None).
- decoder_final_activation_params (Optional *[*dict ]) – Parameters for the final activation function (default: None).
- quantizer_n_q (int) – Number of quantizers (default: 8).
- quantizer_bins (int) – Number of bins for quantization (default: 1024).
- quantizer_decay (float) – Decay factor for quantization (default: 0.99).
- quantizer_kmeans_init (bool) – Whether to initialize with K-means (default: True).
- quantizer_kmeans_iters (int) – Number of K-means iterations (default: 50).
- quantizer_threshold_ema_dead_code (int) – Threshold for dead code (default: 2).
- quantizer_target_bandwidth (List *[*float ]) – Target bandwidths for quantization (default: [7.5, 15]).
- quantizer_dropout (bool) – Whether to use dropout in the quantizer (default: True).

########### Examples

>>> generator = DACGenerator(sample_rate=22050, hidden_dim=256)
>>> input_tensor = torch.randn(1, 1, 48000)  # (B, C, T)
>>> output, commit_loss, quantization_loss, audio_hat_real = generator(input_tensor)

Initialize DAC Generator.

Parameters:TODO (jiatong)

decode(codes: Tensor)

Run decoding to generate waveform from codes.

This method takes in input codes, which are the output of the encoding process, and generates the corresponding waveform.

Parameters:x (Tensor) – Input codes (T_code, N_stream), where T_code is the length of the code sequence and N_stream is the number of quantized streams.
Returns: Generated waveform (T_wav,), which represents the reconstructed audio signal.
Return type: Tensor

########### Examples

>>> # Assume `codes` is a tensor containing encoded audio
>>> generated_waveform = dac.decode(codes)
>>> print(generated_waveform.shape)  # Output: (T_wav,)

NOTE

The input codes should be properly formatted as per the model’s specifications to ensure correct waveform generation.

encode(x: Tensor, target_bw: float | None = None)

Run encoding.

Parameters:x (Tensor) – Input audio (T_wav,).
Returns: Generated codes (T_code, N_stream).
Return type: Tensor

########### Examples

>>> model = DAC()
>>> audio_input = torch.randn(1, 24000)  # Simulate 1 second of audio
>>> codes = model.encode(audio_input)
>>> print(codes.shape)  # Should output the shape of generated codes

NOTE

The input tensor x should be a 1D tensor representing audio waveform data with a shape of (T_wav,). The output will be a tensor containing the generated codes with shape (T_code, N_stream).

forward(x: Tensor, use_dual_decoder: bool = False)

Perform generator forward.

Parameters:
- audio (Tensor) – Audio waveform tensor (B, T_wav).
- forward_generator (bool) – Whether to forward generator.
Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

########### Examples

>>> model = DAC()
>>> audio_input = torch.randn(1, 24000)  # Example audio input
>>> output = model.forward(audio_input, forward_generator=True)
>>> print(output['loss'].item())