espnet2.gan_svs.visinger2.visinger2_vocoder.BaseFrequenceDiscriminator

About 2 min

espnet2.gan_svs.visinger2.visinger2_vocoder.BaseFrequenceDiscriminator

class espnet2.gan_svs.visinger2.visinger2_vocoder.BaseFrequenceDiscriminator(in_channels, hidden_channels=512, divisors=[32, 16, 8, 4, 2, 1, 1], strides=[1, 2, 1, 2, 1, 2, 1])

Bases: Module

Base Frequency Discriminator.

This class implements a base frequency discriminator used for evaluating the quality of generated audio signals by comparing them against real audio samples. The discriminator is composed of multiple layers that progressively reduce the dimensionality of the input through convolutional operations.

Parameters:
- in_channels (int) – Number of input channels.
- hidden_channels (int , optional) – Number of channels in hidden layers. Defaults to 512.
- divisors (List *[*int ] , optional) – List of divisors for the number of channels in each layer. The length of the list determines the number of layers. Defaults to [32, 16, 8, 4, 2, 1, 1].
- strides (List *[*int ] , optional) – List of stride values for each layer. The length of the list determines the number of layers. Defaults to [1, 2, 1, 2, 1, 2, 1].

####### Examples

>>> discriminator = BaseFrequenceDiscriminator(in_channels=1)
>>> input_tensor = torch.randn(8, 1, 128, 128)  # (B, C, H, W)
>>> outputs = discriminator(input_tensor)
>>> for output in outputs:
...     print(output.shape)
(8, 512, 64, 64)
(8, 256, 32, 32)
(8, 128, 16, 16)
(8, 64, 8, 8)
(8, 32, 4, 4)
(8, 16, 2, 2)
(8, 1, 1, 1)

Returns: List of output tensors from each layer of the discriminator, where the first tensor corresponds to the output of the first layer, and so on.
Return type: List[torch.Tensor]

Base Frequence Discriminator

Parameters:
- in_channels (int) – Number of input channels.
- hidden_channels (int , optional) – Number of channels in hidden layers. Defaults to 512.
- divisors (List *[*int ] , optional) – List of divisors for the number of channels in each layer. The length of the list determines the number of layers. Defaults to [32, 16, 8, 4, 2, 1, 1].
- strides (List *[*int ] , optional) – List of stride values for each layer. The length of the list determines the number of layers.Defaults to [1, 2, 1, 2, 1, 2, 1].

forward(x)

Calculate forward propagation.

This method computes the forward pass of the VISinger2 Vocoder Generator. It processes the input tensors through a series of convolutional layers and residual blocks, generating an output tensor representing the synthesized audio.

Parameters:
- c (Tensor) – Input tensor (B, in_channels, T) representing the conditioning information.
- ddsp (Tensor) – Input tensor (B, n_harmonic + 2, T * hop_length) containing harmonic features.
- g (Optional *[*Tensor ]) – Global conditioning tensor (B, global_channels, 1). This tensor can provide additional information for the generation process. If not provided, it defaults to None.
Returns: Output tensor (B, out_channels, T) representing the : generated audio waveform.
Return type: Tensor

####### Examples

>>> generator = VISinger2VocoderGenerator()
>>> c = torch.randn(2, 80, 100)  # Example conditioning input
>>> ddsp = torch.randn(2, 66, 512)  # Example harmonic features
>>> output = generator(c, ddsp)
>>> print(output.shape)
torch.Size([2, 1, 100])  # Example output shape