espnet2.gan_codec.dac.dac.DAC
espnet2.gan_codec.dac.dac.DAC
class espnet2.gan_codec.dac.dac.DAC(sampling_rate: int = 24000, generator_params: Dict[str, Any] = {'decoder_final_activation': None, 'decoder_final_activation_params': None, 'decoder_trim_right_ratio': 1.0, 'encdec_activation': 'Snake', 'encdec_activation_params': {}, 'encdec_causal': False, 'encdec_channels': 1, 'encdec_compress': 2, 'encdec_dilation_base': 2, 'encdec_kernel_size': 7, 'encdec_last_kernel_size': 7, 'encdec_lstm': 2, 'encdec_n_filters': 32, 'encdec_n_residual_layers': 1, 'encdec_norm': 'weight_norm', 'encdec_norm_params': {}, 'encdec_pad_mode': 'reflect', 'encdec_ratios': [8, 5, 4, 2], 'encdec_residual_kernel_size': 7, 'encdec_true_skip': False, 'hidden_dim': 128, 'quantizer_bins': 1024, 'quantizer_decay': 0.99, 'quantizer_dropout': True, 'quantizer_kmeans_init': True, 'quantizer_kmeans_iters': 50, 'quantizer_n_q': 8, 'quantizer_target_bandwidth': [7.5, 15], 'quantizer_threshold_ema_dead_code': 2}, discriminator_params: Dict[str, Any] = {'msmpmb_discriminator_params': {'band_discriminator_params': {'bands': [(0.0, 0.1), (0.1, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)], 'channel': 32, 'hop_factor': 0.25, 'sample_rate': 24000}, 'fft_sizes': [2048, 1024, 512], 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'rates': [], 'sample_rate': 24000}, 'scale_follow_official_norm': False}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 24000, 'log_base': None, 'n_mels': 80, 'range_end': 11, 'range_start': 6, 'window': 'hann'}, use_dual_decoder: bool = True, lambda_quantization: float = 1.0, lambda_reconstruct: float = 1.0, lambda_commit: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)
Bases: AbsGANCodec
DAC model for audio processing using GAN architecture.
The DAC (Discrete Audio Codec) model utilizes a GAN-based architecture to encode and decode audio signals. It features a generator that processes audio waveforms and a discriminator that assesses the quality of generated outputs. This model supports various loss functions and can be configured with multiple parameters for fine-tuning.
generator
The generator module responsible for audio synthesis.
- Type:DACGenerator
discriminator
The discriminator module that evaluates the generated audio.
- Type:DACDiscriminator
generator
Loss function for the generator.
generator
Loss function for audio reconstruction.
- Type: torch.nn.L1Loss
discriminator
Loss function for the discriminator.
use_feat_match_loss
Flag to enable feature matching loss.
- Type: bool
feat_match_loss
Loss function for feature matching.
- Type:FeatureMatchLoss
use_mel_loss
Flag to enable mel spectrogram loss.
- Type: bool
mel_loss
Loss function for mel spectrograms.
use_dual_decoder
Flag to use dual decoding.
- Type: bool
cache_generator_outputs
Flag to cache generator outputs.
- Type: bool
fs
Sampling rate of the audio.
- Type: int
num_streams
Number of quantization streams.
- Type: int
frame_shift
Frame shift size.
- Type: int
code_size_per_stream
Code size per quantization stream.
Type: List[int]
Parameters:
- sampling_rate (int) – The sampling rate of the audio (default: 24000).
- generator_params (Dict *[*str , Any ]) – Parameters for the generator model.
- discriminator_params (Dict *[*str , Any ]) – Parameters for the discriminator model.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameters for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameters for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature matching loss (default: True).
- feat_match_loss_params (Dict *[*str , Any ]) – Parameters for feature matching loss.
- use_mel_loss (bool) – Whether to use mel loss (default: True).
- mel_loss_params (Dict *[*str , Any ]) – Parameters for mel loss.
- use_dual_decoder (bool) – Whether to use a dual decoder (default: True).
- lambda_quantization (float) – Weight for quantization loss (default: 1.0).
- lambda_reconstruct (float) – Weight for reconstruction loss (default: 1.0).
- lambda_commit (float) – Weight for commitment loss (default: 1.0).
- lambda_adv (float) – Weight for adversarial loss (default: 1.0).
- lambda_feat_match (float) – Weight for feature matching loss (default: 2.0).
- lambda_mel (float) – Weight for mel loss (default: 45.0).
- cache_generator_outputs (bool) – Whether to cache generator outputs (default: False).
############### Examples
>>> dac_model = DAC(sampling_rate=22050)
>>> audio_input = torch.randn(1, 22050) # Simulated audio input
>>> output = dac_model(audio_input)
>>> print(output["loss"])
####### NOTE The DAC model requires proper configuration of the generator and discriminator parameters to function effectively. Make sure to consult the documentation for detailed descriptions of each parameter.
- Raises:AssertionError – If dual decoder is used without enabling mel loss.
Intialize DAC model.
- Parameters:TODO (jiatong)
decode(x: Tensor, **kwargs) → Tensor
Run decoding.
This method takes the input codes and generates the corresponding waveform using the DAC generator.
- Parameters:x (Tensor) – Input codes (T_code, N_stream).
- Returns: Generated waveform (T_wav,).
- Return type: Tensor
############### Examples
>>> dac_model = DAC()
>>> codes = torch.randn(100, 8) # Example input codes
>>> waveform = dac_model.decode(codes)
>>> print(waveform.shape)
torch.Size([T_wav,]) # Output shape will depend on the model
####### NOTE Ensure that the input tensor x has the correct shape, matching the expected dimensions for the decoder.
encode(x: Tensor, **kwargs) → Tensor
Run encoding.
This method encodes the input audio tensor into a set of generated codes using the DAC generator. The encoding process involves passing the audio waveform through the generator’s encoder and quantizer.
- Parameters:
- x (Tensor) – Input audio (T_wav,). The shape of the tensor should
- encoder. (be compatible with the expected input of the)
- Returns: Generated codes (T_code, N_stream). The output tensor contains the encoded representation of the input audio, where T_code is the length of the generated codes and N_stream is the number of quantization streams.
- Return type: Tensor
############### Examples
>>> model = DAC()
>>> audio_input = torch.randn(1, 24000) # Simulated audio input
>>> encoded_codes = model.encode(audio_input)
>>> print(encoded_codes.shape)
torch.Size([T_code, N_stream]) # Shape depends on the input and model params
forward(audio: Tensor, forward_generator: bool = True, **kwargs) → Dict[str, Any]
Perform generator forward.
- Parameters:
- audio (Tensor) – Audio waveform tensor (B, T_wav).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
############### Examples
>>> model = DAC()
>>> audio_input = torch.randn(1, 16000) # Example audio tensor
>>> output = model.forward(audio_input)
>>> print(output.keys())
dict_keys(['loss', 'stats', 'weight', 'optim_idx'])
inference(x: Tensor, **kwargs) → Dict[str, Tensor]
Run inference to generate audio from input.
This method takes an input audio tensor and generates the corresponding output waveform and neural codec. The input tensor should be of shape (T_wav,) where T_wav is the length of the audio waveform.
- Parameters:x (Tensor) – Input audio tensor of shape (T_wav,).
- Returns:
- wav (Tensor): Generated waveform tensor of shape (T_wav,).
- codec (Tensor): Generated neural codec tensor of shape (T_code, N_stream).
- Return type: Dict[str, Tensor]
############### Examples
>>> model = DAC()
>>> input_audio = torch.randn(24000) # Example audio input
>>> output = model.inference(input_audio)
>>> generated_wav = output['wav']
>>> generated_codec = output['codec']
####### NOTE Ensure that the input tensor is appropriately preprocessed and matches the expected input format of the model.
meta_info() → Dict[str, Any]
Retrieve metadata information of the DAC model.
This method returns a dictionary containing key metadata attributes of the DAC model, which includes the sampling frequency, number of streams, frame shift, and the code size per stream.
- Returns: A dictionary containing the following keys: : - fs (int): The sampling frequency of the model.
- num_streams (int): The number of quantizer streams.
- frame_shift (int): The frame shift calculated from the encoder-decoder ratios.
- code_size_per_stream (List[int]): A list indicating the code size for each stream.
- Return type: Dict[str, Any]
############### Examples
>>> dac_model = DAC()
>>> info = dac_model.meta_info()
>>> print(info)
{'fs': 24000, 'num_streams': 8, 'frame_shift': 640,
'code_size_per_stream': [1024, 1024, 1024, 1024, 1024,
1024, 1024, 1024]}