espnet2.gan_svs.avocodo.avocodo.AvocodoDiscriminatorPlus

About 3 min

espnet2.gan_svs.avocodo.avocodo.AvocodoDiscriminatorPlus

class espnet2.gan_svs.avocodo.avocodo.AvocodoDiscriminatorPlus(combd: Dict[str, Any] = {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, sbd: Dict[str, Any] = {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'segment_size': 8192, 'use_sbd': True}, pqmf_config: Dict[str, Any] = {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, projection_filters: List[int] = [0, 1, 1, 1], sample_rate: int = 22050, multi_freq_disc_params: Dict[str, Any] = {'divisors': [32, 16, 8, 4, 2, 1, 1], 'domain': 'double', 'hidden_channels': [256, 512, 512], 'hop_length_factors': [4, 8, 16], 'mel_scale': True, 'strides': [1, 2, 1, 2, 1, 2, 1]})

Bases: Module

Avocodo discriminator with additional multi-frequency discriminator.

This class extends the Avocodo Discriminator by incorporating a Multi-Frequency Discriminator (MFD) for enhanced feature extraction from audio signals. It combines outputs from the Collaborative Multi-band Discriminator (CoMBD), Sub-band Discriminator (SBD), and the MFD to produce a more comprehensive analysis of real and generated audio data.

pqmf_lv2

PQMF object for level 2 processing.

Type:PQMF

pqmf_lv1

PQMF object for level 1 processing.

Type:PQMF

combd

Instance of the Collaborative Multi-band Discriminator.

Type:CoMBD

sbd

Instance of the Sub-band Discriminator.

Type:SBD

mfd

Instance of the Multi-Frequency Discriminator.

Type:MultiFrequencyDiscriminator

projection_filters

Filters for the projection layers.

Type: List[int]
Parameters:
- combd (Dict *[*str , Any ]) – Configuration parameters for CoMBD.
- sbd (Dict *[*str , Any ]) – Configuration parameters for SBD.
- pqmf_config (Dict *[*str , Any ]) – Configuration for PQMF.
- projection_filters (List *[*int ]) – Projection filters for output layers.
- sample_rate (int) – Sample rate of the audio signals.
- multi_freq_disc_params (Dict *[*str , Any ]) – Parameters for MFD.
Returns: A list containing outputs and feature maps : from the discriminators.
Return type: List[List[torch.Tensor]]

####### Examples

>>> discriminator = AvocodoDiscriminatorPlus()
>>> real_audio = torch.randn(1, 1, 8192)  # Example real audio tensor
>>> fake_audio = torch.randn(1, 1, 8192)  # Example generated audio tensor
>>> outputs_real, outputs_fake, fmaps_real, fmaps_fake = discriminator(real_audio, fake_audio)

NOTE

The class utilizes the torch library for neural network functionalities and requires the input tensors to be in the shape of (B, C, T), where B is the batch size, C is the number of channels, and T is the length of the time series.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(y: Tensor, y_hats: Tensor) → List[List[Tensor]]

Perform forward propagation through the AvocodoDiscriminatorPlus.

This method takes input tensors and produces outputs from the combined multi-band and sub-band discriminators, as well as a multi-frequency discriminator. It analyzes both real and fake signals to generate corresponding output tensors and feature maps.

Parameters:
- y (torch.Tensor) – Ground truth signal tensor of shape (B, C, T).
- y_hats (torch.Tensor) – Predicted signal tensor of shape (B, C, T).
Returns: A list containing: : - outs_real (List[Tensor]): List of output tensors for real signals.
- outs_fake (List[Tensor]): List of output tensors for fake signals.
- fmaps_real (List[List[Tensor]]): List of feature maps for real signals at each layer.
- fmaps_fake (List[List[Tensor]]): List of feature maps for fake signals at each layer.
Return type: List[List[torch.Tensor]]

####### Examples

>>> discriminator = AvocodoDiscriminatorPlus()
>>> real_signal = torch.randn(1, 1, 1024)  # Example real signal
>>> fake_signal = torch.randn(1, 1, 1024)  # Example fake signal
>>> outs_real, outs_fake, fmaps_real, fmaps_fake = discriminator(real_signal, fake_signal)

NOTE

The output tensors are produced by analyzing the input signals through multiple discriminators, which allows for more comprehensive assessment of the audio signals.