espnet2.enh.layers.dcunet.DiffusionStepEmbedding
espnet2.enh.layers.dcunet.DiffusionStepEmbedding
class espnet2.enh.layers.dcunet.DiffusionStepEmbedding(embed_dim, complex_valued=False)
Bases: Module
Diffusion-Step embedding as in DiffWave / Vaswani et al. 2017.
This class implements the diffusion-step embedding for a neural network. The embedding is based on sinusoidal functions which help in capturing the temporal structure of the input data. It supports both complex and real valued embeddings.
complex_valued
Indicates if the embedding is complex-valued.
- Type: bool
embed_dim
The dimension of the embedding.
Type: int
Parameters:
- embed_dim (int) – The dimension of the embedding.
- complex_valued (bool , optional) – If True, the embedding will be complex-valued. Defaults to False.
Returns: The computed diffusion-step embedding.
Return type: Tensor
####### Examples
>>> embedding = DiffusionStepEmbedding(embed_dim=128)
>>> t = torch.tensor([0.1, 0.2, 0.3])
>>> output = embedding(t)
>>> output.shape
torch.Size([3, 128]) # For real-valued embedding
>>> embedding_complex = DiffusionStepEmbedding(embed_dim=128, complex_valued=True)
>>> output_complex = embedding_complex(t)
>>> output_complex.shape
torch.Size([3, 128]) # For complex-valued embedding
NOTE
The effective embedding dimension is halved when the output is real-valued to avoid ambiguities.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(t)
Forward pass for the DCUNet model.
This method takes a complex spectrogram tensor and a time embedding tensor as input and processes them through the encoder-decoder architecture of the DCUNet model. The input shape is expected to be $(batch, nfreqs, time)$, where $nfreqs - 1$ is divisible by $f_0 * f_1 * … * f_N$ (the frequency strides of the encoders) and $time - 1$ is divisible by $t_0 * t_1 * … * t_N$ (the time strides of the encoders).
- Parameters:
- spec (Tensor) – A complex spectrogram tensor with shape (batch, input_channels, n_freqs, time). The tensor can be 1D, 2D, or 3D with time as the last dimension.
- t (Tensor) – A tensor representing the time step embeddings, typically a 1D tensor.
- Returns: The output tensor, with shape (batch, time) or (time) : depending on the model architecture and input.
- Return type: Tensor
####### Examples
>>> net = DCUNet()
>>> dnn_input = torch.randn(4, 2, 257, 256) + 1j * torch.randn(4, 2, 257, 256)
>>> time_embedding = torch.randn(4)
>>> output = net(dnn_input, time_embedding)
>>> print(output.shape) # Output shape: (4, 1, n_fft, frames)
- Raises:TypeError – If the input tensor does not conform to the required dimensions.