espnet2.tts.prodiff.denoiser.SpectogramDenoiser

About 5 min

espnet2.tts.prodiff.denoiser.SpectogramDenoiser

class espnet2.tts.prodiff.denoiser.SpectogramDenoiser(idim: int, adim: int = 256, layers: int = 20, channels: int = 256, cycle_length: int = 1, timesteps: int = 200, timescale: int = 1, max_beta: float = 40.0, scheduler: str = 'vpsde', dropout_rate: float = 0.05)

Bases: Module

Spectogram Denoiser.

This class implements a denoiser for spectrograms using a diffusion model. It leverages residual blocks and noise scheduling to effectively reduce noise in audio signals.

Reference: : https://arxiv.org/pdf/2207.06389.pdf

idim

Dimension of the inputs.

Type: int

timesteps

Number of timesteps for the diffusion process.

Type: int

scale

Timescale for the diffusion process.

Type: int

num_layers

Number of layers in the denoising model.

Type: int

channels

Number of channels for each layer.

Type: int

in_proj

Convolutional layer for input projection.

Type: nn.Conv1d

denoiser_pos

Positional encoding for denoiser.

Type: PositionalEncoding

denoiser_mlp

Multi-layer perceptron for denoising.

Type: nn.Sequential

denoiser_res

List of residual blocks for denoising.

Type: nn.ModuleList

skip_proj

Convolutional layer for skip connection projection.

Type: nn.Conv1d

feats_out

Convolutional layer for output feature projection.

Type: nn.Conv1d

betas

Beta values for noise scheduling.

Type: torch.Tensor

alphas_cumulative

Cumulative product of alpha values.

Type: torch.Tensor

min_alphas_cumulative

Minimum cumulative alpha values.

Type: torch.Tensor
Parameters:
- idim (int) – Dimension of the inputs.
- adim (int , optional) – Dimension of the hidden states. Defaults to 256.
- layers (int , optional) – Number of layers. Defaults to 20.
- channels (int , optional) – Number of channels of each layer. Defaults to 256.
- cycle_length (int , optional) – Cycle length of the diffusion. Defaults to 1.
- timesteps (int , optional) – Number of timesteps of the diffusion. Defaults to 200.
- timescale (int , optional) – Number of timescale. Defaults to 1.
- max_beta (float , optional) – Maximum beta value for scheduler. Defaults to 40.0.
- scheduler (str , optional) – Type of noise scheduler. Defaults to “vpsde”.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.05.
Returns: Denoised output tensor.
Return type: torch.Tensor

############# Examples

>>> model = SpectogramDenoiser(idim=80)
>>> input_tensor = torch.randn(16, 80, 100)  # (batch, idim, time)
>>> output_tensor = model(input_tensor)

Raises:NotImplementedError – If an unsupported noise scheduler type is provided.

NOTE

The denoiser can be used in both training and inference modes. During inference, it generates denoised output from noisy input using a reverse diffusion process.

Initialization.

Parameters:
- idim (int) – Dimension of the inputs.
- adim (int , optional) – Dimension of the hidden states. Defaults to 256.
- layers (int , optional) – Number of layers. Defaults to 20.
- channels (int , optional) – Number of channels of each layer. Defaults to 256.
- cycle_length (int , optional) – Cycle length of the diffusion. Defaults to 1.
- timesteps (int , optional) – Number of timesteps of the diffusion. Defaults to 200.
- timescale (int , optional) – Number of timescale. Defaults to 1.
- max_beta (float , optional) – Maximum beta value for schedueler. Defaults to 40.
- scheduler (str , optional) – Type of noise scheduler. Defaults to “vpsde”.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.05.

diffusion(xs_ref: Tensor, steps: Tensor, noise: Tensor | None = None) → Tensor

Spectogram Denoiser.

This class implements a denoising model for spectrograms using a diffusion process. The model is designed to work with noisy spectrogram inputs and generate cleaner outputs, effectively removing noise while preserving relevant features.

Reference: : - https://arxiv.org/pdf/2207.06389.pdf

idim

Dimension of the inputs.

Type: int

timesteps

Number of timesteps for the diffusion process.

Type: int

scale

Timescale for the model.

Type: int

num_layers

Number of layers in the model.

Type: int

channels

Number of channels in each layer.

Type: int

in_proj

Input projection layer.

Type: nn.Conv1d

denoiser_pos

Positional encoding for denoising.

Type: PositionalEncoding

denoiser_mlp

Multi-layer perceptron for denoising.

Type: nn.Sequential

denoiser_res

List of residual blocks for denoising.

Type: nn.ModuleList

skip_proj

Skip connection projection layer.

Type: nn.Conv1d

feats_out

Output projection layer for features.

Type: nn.Conv1d

betas

Noise schedule for the diffusion process.

Type: torch.Tensor

alphas_cumulative

Cumulative product of alphas.

Type: torch.Tensor

min_alphas_cumulative

Minimum cumulative product of alphas.

Type: torch.Tensor
Parameters:
- idim (int) – Dimension of the inputs.
- adim (int , optional) – Dimension of the hidden states. Defaults to 256.
- layers (int , optional) – Number of layers. Defaults to 20.
- channels (int , optional) – Number of channels of each layer. Defaults to 256.
- cycle_length (int , optional) – Cycle length of the diffusion. Defaults to 1.
- timesteps (int , optional) – Number of timesteps of the diffusion. Defaults to 200.
- timescale (int , optional) – Number of timescale. Defaults to 1.
- max_beta (float , optional) – Maximum beta value for scheduler. Defaults to 40.0.
- scheduler (str , optional) – Type of noise scheduler. Defaults to “vpsde”.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.05.
Returns: Output tensor of the denoising process.
Return type: torch.Tensor

############# Examples

>>> model = SpectogramDenoiser(idim=128)
>>> noisy_input = torch.randn(32, 128, 100)  # (batch_size, dims, time)
>>> denoised_output = model(noisy_input)

NOTE

This model utilizes a diffusion process for denoising and can be adapted to various noise schedules and configurations based on the task requirements.

forward(xs: Tensor, ys: Tensor | None = None, masks: Tensor | None = None, is_inference: bool = False) → Tensor

Calculate forward propagation.

This method performs the forward pass of the Spectrogram Denoiser. Depending on the value of is_inference, it either performs inference or the denoising process using the provided input and conditioning tensors.

Parameters:
- xs (torch.Tensor) – Phoneme-encoded tensor (#batch, time, dims).
- ys (Optional *[*torch.Tensor ] , optional) – Mel-based reference tensor (#batch, time, mels). Defaults to None.
- masks (Optional *[*torch.Tensor ] , optional) – Mask tensor (#batch, time). Defaults to None.
- is_inference (bool , optional) – Flag to indicate inference mode. Defaults to False.
Returns: Output tensor (#batch, time, dims).
Return type: torch.Tensor

############# Examples

>>> model = SpectogramDenoiser(idim=80)
>>> phoneme_encoded = torch.randn(32, 100, 80)  # Example input
>>> output = model.forward(phoneme_encoded)  # Denoising
>>> output_inference = model.forward(phoneme_encoded, is_inference=True)

forward_denoise(xs_noisy: Tensor, step: Tensor, condition: Tensor) → Tensor

Calculate forward for denoising diffusion.

This method processes the noisy input tensor using the denoising diffusion technique, leveraging conditioning information to enhance the output quality.

Parameters:
- xs_noisy (torch.Tensor) – Input tensor containing noisy data.
- step (torch.Tensor) – Number of diffusion steps, which indicates the level of noise in the input.
- condition (torch.Tensor) – Conditioning tensor that provides additional context to the denoising process.
Returns: Denoised tensor, which has been processed to reduce : noise and improve signal quality.
Return type: torch.Tensor

############# Examples

>>> xs_noisy = torch.randn(10, 256)  # Example noisy input
>>> step = torch.tensor([5] * 10)    # Example diffusion step
>>> condition = torch.randn(10, 256)  # Example conditioning input
>>> denoised_output = model.forward_denoise(xs_noisy, step, condition)

inference(condition: Tensor) → Tensor

Calculate forward during inference.

This method implements the inference process of the Spectogram Denoiser. It generates a noisy tensor and iteratively denoises it using the provided conditioning tensor, following the reverse diffusion process.

Parameters:condition (torch.Tensor) – Conditioning tensor (batch, time, dims).
Returns: Output tensor, which is the denoised result after processing through the model.
Return type: torch.Tensor

############# Examples

>>> denoiser = SpectogramDenoiser(idim=80)
>>> condition = torch.randn(32, 100, 80)  # Example conditioning input
>>> output = denoiser.inference(condition)
>>> print(output.shape)
torch.Size([32, 100, 80])