espnet2.enh.layers.dcunet.DiffusionStepEmbedding

About 1 min

espnet2.enh.layers.dcunet.DiffusionStepEmbedding

class espnet2.enh.layers.dcunet.DiffusionStepEmbedding(embed_dim, complex_valued=False)

Bases: Module

Diffusion-Step embedding as in DiffWave / Vaswani et al. 2017.

This class implements the diffusion-step embedding for a neural network. The embedding is based on sinusoidal functions which help in capturing the temporal structure of the input data. It supports both complex and real valued embeddings.

complex_valued

Indicates if the embedding is complex-valued.

Type: bool

embed_dim

The dimension of the embedding.

Type: int
Parameters:
- embed_dim (int) – The dimension of the embedding.
- complex_valued (bool , optional) – If True, the embedding will be complex-valued. Defaults to False.
Returns: The computed diffusion-step embedding.
Return type: Tensor

####### Examples

>>> embedding = DiffusionStepEmbedding(embed_dim=128)
>>> t = torch.tensor([0.1, 0.2, 0.3])
>>> output = embedding(t)
>>> output.shape
torch.Size([3, 128])  # For real-valued embedding

>>> embedding_complex = DiffusionStepEmbedding(embed_dim=128, complex_valued=True)
>>> output_complex = embedding_complex(t)
>>> output_complex.shape
torch.Size([3, 128])  # For complex-valued embedding

NOTE

The effective embedding dimension is halved when the output is real-valued to avoid ambiguities.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(t)

Forward pass for the DCUNet model.

This method takes a complex spectrogram tensor and a time embedding tensor as input and processes them through the encoder-decoder architecture of the DCUNet model. The input shape is expected to be $(batch, nfreqs, time)$, where $nfreqs - 1$ is divisible by $f_0 * f_1 * … * f_N$ (the frequency strides of the encoders) and $time - 1$ is divisible by $t_0 * t_1 * … * t_N$ (the time strides of the encoders).

Parameters:
- spec (Tensor) – A complex spectrogram tensor with shape (batch, input_channels, n_freqs, time). The tensor can be 1D, 2D, or 3D with time as the last dimension.
- t (Tensor) – A tensor representing the time step embeddings, typically a 1D tensor.
Returns: The output tensor, with shape (batch, time) or (time) : depending on the model architecture and input.
Return type: Tensor

####### Examples

>>> net = DCUNet()
>>> dnn_input = torch.randn(4, 2, 257, 256) + 1j * torch.randn(4, 2, 257, 256)
>>> time_embedding = torch.randn(4)
>>> output = net(dnn_input, time_embedding)
>>> print(output.shape)  # Output shape: (4, 1, n_fft, frames)

Raises:TypeError – If the input tensor does not conform to the required dimensions.