espnet2.gan_codec.dac.dac.DAC

About 5 min

espnet2.gan_codec.dac.dac.DAC

class espnet2.gan_codec.dac.dac.DAC(sampling_rate: int = 24000, generator_params: Dict[str, Any] = {'decoder_final_activation': None, 'decoder_final_activation_params': None, 'decoder_trim_right_ratio': 1.0, 'encdec_activation': 'Snake', 'encdec_activation_params': {}, 'encdec_causal': False, 'encdec_channels': 1, 'encdec_compress': 2, 'encdec_dilation_base': 2, 'encdec_kernel_size': 7, 'encdec_last_kernel_size': 7, 'encdec_lstm': 2, 'encdec_n_filters': 32, 'encdec_n_residual_layers': 1, 'encdec_norm': 'weight_norm', 'encdec_norm_params': {}, 'encdec_pad_mode': 'reflect', 'encdec_ratios': [8, 5, 4, 2], 'encdec_residual_kernel_size': 7, 'encdec_true_skip': False, 'hidden_dim': 128, 'quantizer_bins': 1024, 'quantizer_decay': 0.99, 'quantizer_dropout': True, 'quantizer_kmeans_init': True, 'quantizer_kmeans_iters': 50, 'quantizer_n_q': 8, 'quantizer_target_bandwidth': [7.5, 15], 'quantizer_threshold_ema_dead_code': 2}, discriminator_params: Dict[str, Any] = {'msmpmb_discriminator_params': {'band_discriminator_params': {'bands': [(0.0, 0.1), (0.1, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)], 'channel': 32, 'hop_factor': 0.25, 'sample_rate': 24000}, 'fft_sizes': [2048, 1024, 512], 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'rates': [], 'sample_rate': 24000}, 'scale_follow_official_norm': False}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 24000, 'log_base': None, 'n_mels': 80, 'range_end': 11, 'range_start': 6, 'window': 'hann'}, use_dual_decoder: bool = True, lambda_quantization: float = 1.0, lambda_reconstruct: float = 1.0, lambda_commit: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)

Bases: AbsGANCodec

DAC model for audio processing using GAN architecture.

The DAC (Discrete Audio Codec) model utilizes a GAN-based architecture to encode and decode audio signals. It features a generator that processes audio waveforms and a discriminator that assesses the quality of generated outputs. This model supports various loss functions and can be configured with multiple parameters for fine-tuning.

generator

The generator module responsible for audio synthesis.

Type:DACGenerator

discriminator

The discriminator module that evaluates the generated audio.

Type:DACDiscriminator

generator

_adv_loss

Loss function for the generator.

Type:GeneratorAdversarialLoss

generator

_reconstruct_loss

Loss function for audio reconstruction.

Type: torch.nn.L1Loss

discriminator

_adv_loss

Loss function for the discriminator.

Type:DiscriminatorAdversarialLoss

use_feat_match_loss

Flag to enable feature matching loss.

Type: bool

feat_match_loss

Loss function for feature matching.

Type:FeatureMatchLoss

use_mel_loss

Flag to enable mel spectrogram loss.

Type: bool

mel_loss

Loss function for mel spectrograms.

Type:MultiScaleMelSpectrogramLoss

use_dual_decoder

Flag to use dual decoding.

Type: bool

cache_generator_outputs

Flag to cache generator outputs.

Type: bool

Sampling rate of the audio.

Type: int

num_streams

Number of quantization streams.

Type: int

frame_shift

Frame shift size.

Type: int

code_size_per_stream

Code size per quantization stream.

Type: List[int]
Parameters:
- sampling_rate (int) – The sampling rate of the audio (default: 24000).
- generator_params (Dict *[*str , Any ]) – Parameters for the generator model.
- discriminator_params (Dict *[*str , Any ]) – Parameters for the discriminator model.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameters for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameters for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature matching loss (default: True).
- feat_match_loss_params (Dict *[*str , Any ]) – Parameters for feature matching loss.
- use_mel_loss (bool) – Whether to use mel loss (default: True).
- mel_loss_params (Dict *[*str , Any ]) – Parameters for mel loss.
- use_dual_decoder (bool) – Whether to use a dual decoder (default: True).
- lambda_quantization (float) – Weight for quantization loss (default: 1.0).
- lambda_reconstruct (float) – Weight for reconstruction loss (default: 1.0).
- lambda_commit (float) – Weight for commitment loss (default: 1.0).
- lambda_adv (float) – Weight for adversarial loss (default: 1.0).
- lambda_feat_match (float) – Weight for feature matching loss (default: 2.0).
- lambda_mel (float) – Weight for mel loss (default: 45.0).
- cache_generator_outputs (bool) – Whether to cache generator outputs (default: False).

############### Examples

>>> dac_model = DAC(sampling_rate=22050)
>>> audio_input = torch.randn(1, 22050)  # Simulated audio input
>>> output = dac_model(audio_input)
>>> print(output["loss"])

####### NOTE The DAC model requires proper configuration of the generator and discriminator parameters to function effectively. Make sure to consult the documentation for detailed descriptions of each parameter.

Raises:AssertionError – If dual decoder is used without enabling mel loss.

Intialize DAC model.

Parameters:TODO (jiatong)

decode(x: Tensor, **kwargs) → Tensor

Run decoding.

This method takes the input codes and generates the corresponding waveform using the DAC generator.

Parameters:x (Tensor) – Input codes (T_code, N_stream).
Returns: Generated waveform (T_wav,).
Return type: Tensor

############### Examples

>>> dac_model = DAC()
>>> codes = torch.randn(100, 8)  # Example input codes
>>> waveform = dac_model.decode(codes)
>>> print(waveform.shape)
torch.Size([T_wav,])  # Output shape will depend on the model

####### NOTE Ensure that the input tensor x has the correct shape, matching the expected dimensions for the decoder.

encode(x: Tensor, **kwargs) → Tensor

Run encoding.

This method encodes the input audio tensor into a set of generated codes using the DAC generator. The encoding process involves passing the audio waveform through the generator’s encoder and quantizer.

Parameters:
- x (Tensor) – Input audio (T_wav,). The shape of the tensor should
- encoder. (be compatible with the expected input of the)
Returns: Generated codes (T_code, N_stream). The output tensor contains the encoded representation of the input audio, where T_code is the length of the generated codes and N_stream is the number of quantization streams.
Return type: Tensor

############### Examples

>>> model = DAC()
>>> audio_input = torch.randn(1, 24000)  # Simulated audio input
>>> encoded_codes = model.encode(audio_input)
>>> print(encoded_codes.shape)
torch.Size([T_code, N_stream])  # Shape depends on the input and model params

forward(audio: Tensor, forward_generator: bool = True, **kwargs) → Dict[str, Any]

Perform generator forward.

Parameters:
- audio (Tensor) – Audio waveform tensor (B, T_wav).
- forward_generator (bool) – Whether to forward generator.
Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

############### Examples

>>> model = DAC()
>>> audio_input = torch.randn(1, 16000)  # Example audio tensor
>>> output = model.forward(audio_input)
>>> print(output.keys())
dict_keys(['loss', 'stats', 'weight', 'optim_idx'])

inference(x: Tensor, **kwargs) → Dict[str, Tensor]

Run inference to generate audio from input.

This method takes an input audio tensor and generates the corresponding output waveform and neural codec. The input tensor should be of shape (T_wav,) where T_wav is the length of the audio waveform.

Parameters:x (Tensor) – Input audio tensor of shape (T_wav,).
Returns:
- wav (Tensor): Generated waveform tensor of shape (T_wav,).
- codec (Tensor): Generated neural codec tensor of shape (T_code, N_stream).
Return type: Dict[str, Tensor]

############### Examples

>>> model = DAC()
>>> input_audio = torch.randn(24000)  # Example audio input
>>> output = model.inference(input_audio)
>>> generated_wav = output['wav']
>>> generated_codec = output['codec']

####### NOTE Ensure that the input tensor is appropriately preprocessed and matches the expected input format of the model.

meta_info() → Dict[str, Any]

Retrieve metadata information of the DAC model.

This method returns a dictionary containing key metadata attributes of the DAC model, which includes the sampling frequency, number of streams, frame shift, and the code size per stream.

Returns: A dictionary containing the following keys: : - fs (int): The sampling frequency of the model.
- num_streams (int): The number of quantizer streams.
- frame_shift (int): The frame shift calculated from the encoder-decoder ratios.
- code_size_per_stream (List[int]): A list indicating the code size for each stream.
Return type: Dict[str, Any]

############### Examples

>>> dac_model = DAC()
>>> info = dac_model.meta_info()
>>> print(info)
{'fs': 24000, 'num_streams': 8, 'frame_shift': 640,
 'code_size_per_stream': [1024, 1024, 1024, 1024, 1024,
 1024, 1024, 1024]}