espnet2.gan_codec.dac.dac.DACGenerator
espnet2.gan_codec.dac.dac.DACGenerator
class espnet2.gan_codec.dac.dac.DACGenerator(sample_rate: int = 24000, hidden_dim: int = 128, codebook_dim: int = 8, encdec_channels: int = 1, encdec_n_filters: int = 32, encdec_n_residual_layers: int = 1, encdec_ratios: List[int] = [8, 5, 4, 2], encdec_activation: str = 'Snake', encdec_activation_params: Dict[str, Any] = {}, encdec_norm: str = 'weight_norm', encdec_norm_params: Dict[str, Any] = {}, encdec_kernel_size: int = 7, encdec_residual_kernel_size: int = 7, encdec_last_kernel_size: int = 7, encdec_dilation_base: int = 2, encdec_causal: bool = False, encdec_pad_mode: str = 'reflect', encdec_true_skip: bool = False, encdec_compress: int = 2, encdec_lstm: int = 2, decoder_trim_right_ratio: float = 1.0, decoder_final_activation: str | None = None, decoder_final_activation_params: dict | None = None, quantizer_n_q: int = 8, quantizer_bins: int = 1024, quantizer_decay: float = 0.99, quantizer_kmeans_init: bool = True, quantizer_kmeans_iters: int = 50, quantizer_threshold_ema_dead_code: int = 2, quantizer_target_bandwidth: List[float] = [7.5, 15], quantizer_dropout: bool = True)
Bases: Module
DAC generator module.
This module implements the generator for the DAC (Discrete Audio Codec) model. It utilizes an encoder-decoder architecture with quantization to generate audio waveforms from input tensors. The generator is designed to be flexible, allowing for various configurations of encoder and decoder parameters.
encoder
The encoder component of the DAC generator.
- Type:SEANetEncoder
quantizer
The quantizer for encoding.
target_bandwidths
List of target bandwidths for quantization.
- Type: List[float]
sample_rate
The sample rate of the audio.
- Type: int
frame_rate
The frame rate calculated from the sample rate and encoder-decoder ratios.
- Type: int
decoder
The decoder component of the DAC generator.
- Type:SEANetDecoder
l1_quantization_loss
L1 loss for quantization.
- Type: torch.nn.L1Loss
l2_quantization_loss
L2 loss for quantization.
Type: torch.nn.MSELoss
Parameters:
- sample_rate (int) – The sample rate of the audio (default: 24000).
- hidden_dim (int) – Dimension of hidden layers (default: 128).
- codebook_dim (int) – Dimension of the codebook for quantization (default: 8).
- encdec_channels (int) – Number of channels for encoder/decoder (default: 1).
- encdec_n_filters (int) – Number of filters for encoder/decoder (default: 32).
- encdec_n_residual_layers (int) – Number of residual layers (default: 1).
- encdec_ratios (List *[*int ]) – Ratios for downsampling (default: [8, 5, 4, 2]).
- encdec_activation (str) – Activation function used (default: “Snake”).
- encdec_activation_params (Dict *[*str , Any ]) – Parameters for activation function (default: {}).
- encdec_norm (str) – Normalization method used (default: “weight_norm”).
- encdec_norm_params (Dict *[*str , Any ]) – Parameters for normalization (default: {}).
- encdec_kernel_size (int) – Kernel size for convolution layers (default: 7).
- encdec_residual_kernel_size (int) – Kernel size for residual connections (default: 7).
- encdec_last_kernel_size (int) – Kernel size for the last layer (default: 7).
- encdec_dilation_base (int) – Dilation base for convolution layers (default: 2).
- encdec_causal (bool) – Whether to use causal convolutions (default: False).
- encdec_pad_mode (str) – Padding mode for convolutions (default: “reflect”).
- encdec_true_skip (bool) – Whether to use true skip connections (default: False).
- encdec_compress (int) – Compression factor for the encoder (default: 2).
- encdec_lstm (int) – Number of LSTM layers (default: 2).
- decoder_trim_right_ratio (float) – Trim ratio for the decoder output (default: 1.0).
- decoder_final_activation (Optional *[*str ]) – Final activation function for the decoder (default: None).
- decoder_final_activation_params (Optional *[*dict ]) – Parameters for the final activation function (default: None).
- quantizer_n_q (int) – Number of quantizers (default: 8).
- quantizer_bins (int) – Number of bins for quantization (default: 1024).
- quantizer_decay (float) – Decay factor for quantization (default: 0.99).
- quantizer_kmeans_init (bool) – Whether to initialize with K-means (default: True).
- quantizer_kmeans_iters (int) – Number of K-means iterations (default: 50).
- quantizer_threshold_ema_dead_code (int) – Threshold for dead code (default: 2).
- quantizer_target_bandwidth (List *[*float ]) – Target bandwidths for quantization (default: [7.5, 15]).
- quantizer_dropout (bool) – Whether to use dropout in the quantizer (default: True).
########### Examples
>>> generator = DACGenerator(sample_rate=22050, hidden_dim=256)
>>> input_tensor = torch.randn(1, 1, 48000) # (B, C, T)
>>> output, commit_loss, quantization_loss, audio_hat_real = generator(input_tensor)
Initialize DAC Generator.
- Parameters:TODO (jiatong)
decode(codes: Tensor)
Run decoding to generate waveform from codes.
This method takes in input codes, which are the output of the encoding process, and generates the corresponding waveform.
- Parameters:x (Tensor) – Input codes (T_code, N_stream), where T_code is the length of the code sequence and N_stream is the number of quantized streams.
- Returns: Generated waveform (T_wav,), which represents the reconstructed audio signal.
- Return type: Tensor
########### Examples
>>> # Assume `codes` is a tensor containing encoded audio
>>> generated_waveform = dac.decode(codes)
>>> print(generated_waveform.shape) # Output: (T_wav,)
NOTE
The input codes should be properly formatted as per the model’s specifications to ensure correct waveform generation.
encode(x: Tensor, target_bw: float | None = None)
Run encoding.
- Parameters:x (Tensor) – Input audio (T_wav,).
- Returns: Generated codes (T_code, N_stream).
- Return type: Tensor
########### Examples
>>> model = DAC()
>>> audio_input = torch.randn(1, 24000) # Simulate 1 second of audio
>>> codes = model.encode(audio_input)
>>> print(codes.shape) # Should output the shape of generated codes
NOTE
The input tensor x should be a 1D tensor representing audio waveform data with a shape of (T_wav,). The output will be a tensor containing the generated codes with shape (T_code, N_stream).
forward(x: Tensor, use_dual_decoder: bool = False)
Perform generator forward.
- Parameters:
- audio (Tensor) – Audio waveform tensor (B, T_wav).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
########### Examples
>>> model = DAC()
>>> audio_input = torch.randn(1, 24000) # Example audio input
>>> output = model.forward(audio_input, forward_generator=True)
>>> print(output['loss'].item())