espnet2.tts.prodiff.denoiser.SpectogramDenoiser
espnet2.tts.prodiff.denoiser.SpectogramDenoiser
class espnet2.tts.prodiff.denoiser.SpectogramDenoiser(idim: int, adim: int = 256, layers: int = 20, channels: int = 256, cycle_length: int = 1, timesteps: int = 200, timescale: int = 1, max_beta: float = 40.0, scheduler: str = 'vpsde', dropout_rate: float = 0.05)
Bases: Module
Spectogram Denoiser.
This class implements a denoiser for spectrograms using a diffusion model. It leverages residual blocks and noise scheduling to effectively reduce noise in audio signals.
Reference: : https://arxiv.org/pdf/2207.06389.pdf
idim
Dimension of the inputs.
- Type: int
timesteps
Number of timesteps for the diffusion process.
- Type: int
scale
Timescale for the diffusion process.
- Type: int
num_layers
Number of layers in the denoising model.
- Type: int
channels
Number of channels for each layer.
- Type: int
in_proj
Convolutional layer for input projection.
- Type: nn.Conv1d
denoiser_pos
Positional encoding for denoiser.
- Type: PositionalEncoding
denoiser_mlp
Multi-layer perceptron for denoising.
- Type: nn.Sequential
denoiser_res
List of residual blocks for denoising.
- Type: nn.ModuleList
skip_proj
Convolutional layer for skip connection projection.
- Type: nn.Conv1d
feats_out
Convolutional layer for output feature projection.
- Type: nn.Conv1d
betas
Beta values for noise scheduling.
- Type: torch.Tensor
alphas_cumulative
Cumulative product of alpha values.
- Type: torch.Tensor
min_alphas_cumulative
Minimum cumulative alpha values.
Type: torch.Tensor
Parameters:
- idim (int) – Dimension of the inputs.
- adim (int , optional) – Dimension of the hidden states. Defaults to 256.
- layers (int , optional) – Number of layers. Defaults to 20.
- channels (int , optional) – Number of channels of each layer. Defaults to 256.
- cycle_length (int , optional) – Cycle length of the diffusion. Defaults to 1.
- timesteps (int , optional) – Number of timesteps of the diffusion. Defaults to 200.
- timescale (int , optional) – Number of timescale. Defaults to 1.
- max_beta (float , optional) – Maximum beta value for scheduler. Defaults to 40.0.
- scheduler (str , optional) – Type of noise scheduler. Defaults to “vpsde”.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.05.
Returns: Denoised output tensor.
Return type: torch.Tensor
############# Examples
>>> model = SpectogramDenoiser(idim=80)
>>> input_tensor = torch.randn(16, 80, 100) # (batch, idim, time)
>>> output_tensor = model(input_tensor)
- Raises:NotImplementedError – If an unsupported noise scheduler type is provided.
NOTE
The denoiser can be used in both training and inference modes. During inference, it generates denoised output from noisy input using a reverse diffusion process.
Initialization.
- Parameters:
- idim (int) – Dimension of the inputs.
- adim (int , optional) – Dimension of the hidden states. Defaults to 256.
- layers (int , optional) – Number of layers. Defaults to 20.
- channels (int , optional) – Number of channels of each layer. Defaults to 256.
- cycle_length (int , optional) – Cycle length of the diffusion. Defaults to 1.
- timesteps (int , optional) – Number of timesteps of the diffusion. Defaults to 200.
- timescale (int , optional) – Number of timescale. Defaults to 1.
- max_beta (float , optional) – Maximum beta value for schedueler. Defaults to 40.
- scheduler (str , optional) – Type of noise scheduler. Defaults to “vpsde”.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.05.
diffusion(xs_ref: Tensor, steps: Tensor, noise: Tensor | None = None) → Tensor
Spectogram Denoiser.
This class implements a denoising model for spectrograms using a diffusion process. The model is designed to work with noisy spectrogram inputs and generate cleaner outputs, effectively removing noise while preserving relevant features.
Reference: : - https://arxiv.org/pdf/2207.06389.pdf
idim
Dimension of the inputs.
- Type: int
timesteps
Number of timesteps for the diffusion process.
- Type: int
scale
Timescale for the model.
- Type: int
num_layers
Number of layers in the model.
- Type: int
channels
Number of channels in each layer.
- Type: int
in_proj
Input projection layer.
- Type: nn.Conv1d
denoiser_pos
Positional encoding for denoising.
- Type: PositionalEncoding
denoiser_mlp
Multi-layer perceptron for denoising.
- Type: nn.Sequential
denoiser_res
List of residual blocks for denoising.
- Type: nn.ModuleList
skip_proj
Skip connection projection layer.
- Type: nn.Conv1d
feats_out
Output projection layer for features.
- Type: nn.Conv1d
betas
Noise schedule for the diffusion process.
- Type: torch.Tensor
alphas_cumulative
Cumulative product of alphas.
- Type: torch.Tensor
min_alphas_cumulative
Minimum cumulative product of alphas.
Type: torch.Tensor
Parameters:
- idim (int) – Dimension of the inputs.
- adim (int , optional) – Dimension of the hidden states. Defaults to 256.
- layers (int , optional) – Number of layers. Defaults to 20.
- channels (int , optional) – Number of channels of each layer. Defaults to 256.
- cycle_length (int , optional) – Cycle length of the diffusion. Defaults to 1.
- timesteps (int , optional) – Number of timesteps of the diffusion. Defaults to 200.
- timescale (int , optional) – Number of timescale. Defaults to 1.
- max_beta (float , optional) – Maximum beta value for scheduler. Defaults to 40.0.
- scheduler (str , optional) – Type of noise scheduler. Defaults to “vpsde”.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.05.
Returns: Output tensor of the denoising process.
Return type: torch.Tensor
############# Examples
>>> model = SpectogramDenoiser(idim=128)
>>> noisy_input = torch.randn(32, 128, 100) # (batch_size, dims, time)
>>> denoised_output = model(noisy_input)
NOTE
This model utilizes a diffusion process for denoising and can be adapted to various noise schedules and configurations based on the task requirements.
forward(xs: Tensor, ys: Tensor | None = None, masks: Tensor | None = None, is_inference: bool = False) → Tensor
Calculate forward propagation.
This method performs the forward pass of the Spectrogram Denoiser. Depending on the value of is_inference, it either performs inference or the denoising process using the provided input and conditioning tensors.
- Parameters:
- xs (torch.Tensor) – Phoneme-encoded tensor (#batch, time, dims).
- ys (Optional *[*torch.Tensor ] , optional) – Mel-based reference tensor (#batch, time, mels). Defaults to None.
- masks (Optional *[*torch.Tensor ] , optional) – Mask tensor (#batch, time). Defaults to None.
- is_inference (bool , optional) – Flag to indicate inference mode. Defaults to False.
- Returns: Output tensor (#batch, time, dims).
- Return type: torch.Tensor
############# Examples
>>> model = SpectogramDenoiser(idim=80)
>>> phoneme_encoded = torch.randn(32, 100, 80) # Example input
>>> output = model.forward(phoneme_encoded) # Denoising
>>> output_inference = model.forward(phoneme_encoded, is_inference=True)
forward_denoise(xs_noisy: Tensor, step: Tensor, condition: Tensor) → Tensor
Calculate forward for denoising diffusion.
This method processes the noisy input tensor using the denoising diffusion technique, leveraging conditioning information to enhance the output quality.
- Parameters:
- xs_noisy (torch.Tensor) – Input tensor containing noisy data.
- step (torch.Tensor) – Number of diffusion steps, which indicates the level of noise in the input.
- condition (torch.Tensor) – Conditioning tensor that provides additional context to the denoising process.
- Returns: Denoised tensor, which has been processed to reduce : noise and improve signal quality.
- Return type: torch.Tensor
############# Examples
>>> xs_noisy = torch.randn(10, 256) # Example noisy input
>>> step = torch.tensor([5] * 10) # Example diffusion step
>>> condition = torch.randn(10, 256) # Example conditioning input
>>> denoised_output = model.forward_denoise(xs_noisy, step, condition)
inference(condition: Tensor) → Tensor
Calculate forward during inference.
This method implements the inference process of the Spectogram Denoiser. It generates a noisy tensor and iteratively denoises it using the provided conditioning tensor, following the reverse diffusion process.
- Parameters:condition (torch.Tensor) – Conditioning tensor (batch, time, dims).
- Returns: Output tensor, which is the denoised result after processing through the model.
- Return type: torch.Tensor
############# Examples
>>> denoiser = SpectogramDenoiser(idim=80)
>>> condition = torch.randn(32, 100, 80) # Example conditioning input
>>> output = denoiser.inference(condition)
>>> print(output.shape)
torch.Size([32, 100, 80])