espnet2.gan_codec.encodec.encodec.Encodec
espnet2.gan_codec.encodec.encodec.Encodec
class espnet2.gan_codec.encodec.encodec.Encodec(sampling_rate: int = 24000, generator_params: Dict[str, Any] = {'decoder_final_activation': None, 'decoder_final_activation_params': None, 'decoder_trim_right_ratio': 1.0, 'encdec_activation': 'ELU', 'encdec_activation_params': {'alpha': 1.0}, 'encdec_causal': False, 'encdec_channels': 1, 'encdec_compress': 2, 'encdec_dilation_base': 2, 'encdec_kernel_size': 7, 'encdec_last_kernel_size': 7, 'encdec_lstm': 2, 'encdec_n_filters': 32, 'encdec_n_residual_layers': 1, 'encdec_norm': 'weight_norm', 'encdec_norm_params': {}, 'encdec_pad_mode': 'reflect', 'encdec_ratios': [8, 5, 4, 2], 'encdec_residual_kernel_size': 7, 'encdec_true_skip': False, 'hidden_dim': 128, 'quantizer_bins': 1024, 'quantizer_decay': 0.99, 'quantizer_kmeans_init': True, 'quantizer_kmeans_iters': 50, 'quantizer_n_q': 8, 'quantizer_target_bandwidth': [7.5, 15], 'quantizer_threshold_ema_dead_code': 2}, discriminator_params: Dict[str, Any] = {'activation': 'LeakyReLU', 'activation_params': {'negative_slope': 0.3}, 'filters': 32, 'hop_lengths': [256, 512, 128, 64, 32], 'in_channels': 1, 'n_ffts': [1024, 2048, 512, 256, 128], 'norm': 'weight_norm', 'out_channels': 1, 'sep_channels': False, 'win_lengths': [1024, 2048, 512, 256, 128]}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 24000, 'log_base': None, 'n_mels': 80, 'range_end': 11, 'range_start': 6, 'window': 'hann'}, use_dual_decoder: bool = True, lambda_quantization: float = 1.0, lambda_reconstruct: float = 1.0, lambda_commit: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False, use_loss_balancer: bool = False, balance_ema_decay: float = 0.99)
Bases: SoundStream
Encodec Model for audio encoding and decoding.
This model is based on the SoundStream architecture with modifications to the discriminator and loss balancer. It is designed for efficient audio encoding and reconstruction tasks.
For more details, refer to the paper: https://arxiv.org/abs/2210.13438
discriminator
The discriminator component of the model.
Type:EncodecDiscriminator
Parameters:
- sampling_rate (int) – The sampling rate of the audio. Default is 24000.
- generator_params (Dict *[*str , Any ]) – Parameters for the generator model.
- discriminator_params (Dict *[*str , Any ]) – Parameters for the discriminator model.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameters for the generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameters for the discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature matching loss. Default is True.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameters for feature matching loss.
- use_mel_loss (bool) – Whether to use mel loss. Default is True.
- mel_loss_params (Dict *[*str , Any ]) – Parameters for mel loss.
- use_dual_decoder (bool) – Whether to use dual decoding mechanism. Default is True.
- lambda_quantization (float) – Weight for quantization loss. Default is 1.0.
- lambda_reconstruct (float) – Weight for reconstruction loss. Default is 1.0.
- lambda_commit (float) – Weight for commitment loss. Default is 1.0.
- lambda_adv (float) – Weight for adversarial loss. Default is 1.0.
- lambda_feat_match (float) – Weight for feature matching loss. Default is 2.0.
- lambda_mel (float) – Weight for mel loss. Default is 45.0.
- cache_generator_outputs (bool) – Whether to cache generator outputs. Default is False.
- use_loss_balancer (bool) – Whether to use loss balancing. Default is False.
- balance_ema_decay (float) – Exponential moving average decay for loss balancing. Default is 0.99.
Examples
Creating an instance of the Encodec model
model = Encodec(sampling_rate=24000, use_feat_match_loss=True)
Intialize SoundStream model.
- Parameters:TODO (jiatong)