espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock
espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock
class espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock(in_chan, out_chan, kernel_size, stride, padding, dilation, norm_type='bN', activation='leaky_relu', embed_dim=None, complex_time_embedding=False, temb_layers=1, temb_activation='silu')
Bases: Module
DCUNet Complex Encoder Block.
This block is a key component of the DCUNet architecture, which is used for speech enhancement tasks. It performs complex-valued convolutional operations, normalization, and activation functions. The encoder block processes the input features and incorporates time embedding if specified.
in_chan
Number of input channels.
- Type: int
out_chan
Number of output channels.
- Type: int
kernel_size
Size of the convolutional kernel.
- Type: tuple
stride
Stride of the convolution.
- Type: tuple
padding
Padding applied to the input.
- Type: tuple
dilation
Dilation applied to the convolution.
- Type: tuple
temb_layers
Number of time embedding layers.
- Type: int
temb_activation
Activation function for time embedding.
- Type: str
complex_time_embedding
Whether to use complex time embedding.
- Type: bool
conv
Complex convolutional layer.
- Type:ComplexConv2d
norm
Normalization layer.
- Type: nn.Module
activation
Activation function.
- Type: nn.Module
embed_dim
Dimension of the embedding space.
- Type: int
embed_layer
Sequential layer for embedding.
Type: nn.Sequential
Parameters:
- in_chan (int) – Number of input channels.
- out_chan (int) – Number of output channels.
- kernel_size (tuple) – Size of the convolutional kernel.
- stride (tuple) – Stride of the convolution.
- padding (tuple) – Padding applied to the input.
- dilation (tuple) – Dilation applied to the convolution.
- norm_type (str) – Type of normalization to use (default: “bN”).
- activation (str) – Activation function to use (default: “leaky_relu”).
- embed_dim (int , optional) – Dimension of the embedding space.
- complex_time_embedding (bool , optional) – Whether to use complex time embedding (default: False).
- temb_layers (int , optional) – Number of time embedding layers (default: 1).
- temb_activation (str , optional) – Activation function for time embedding (default: “silu”).
Returns: None
####### Examples
>>> encoder_block = DCUNetComplexEncoderBlock(1, 32, (3, 3), (1, 1),
... (1, 1), (1, 1))
>>> input_tensor = torch.randn(4, 1, 64, 64) + 1j * torch.randn(4, 1, 64, 64)
>>> output = encoder_block(input_tensor, None)
>>> print(output.shape)
torch.Size([4, 32, 64, 64])NOTE
The input tensor should be complex-valued and have the shape (batch_size, channels, height, width).
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x, t_embed)
Processes the input complex spectrogram and time embedding through the
DCUNet architecture.
The input shape is expected to be $(batch, nfreqs, time)$, where $nfreqs - 1$ is divisible by $f_0 * f_1 * … * f_N$ where $f_k$ are the frequency strides of the encoders, and $time - 1$ is divisible by $t_0 * t_1 * … * t_N$ where $t_N$ are the time strides of the encoders.
- Parameters:
- spec (Tensor) – Complex spectrogram tensor. It can be a 1D, 2D, or 3D tensor, with time being the last dimension.
- t (Tensor) – Time embedding tensor to provide additional context for the processing.
- Returns: Output tensor, of shape (batch, time) or (time) after processing : through the network.
- Return type: Tensor
####### Examples
>>> net = DCUNet()
>>> dnn_input = torch.randn(4, 2, 257, 256) + 1j * torch.randn(4, 2, 257, 256)
>>> time_embedding = torch.randn(4)
>>> output = net(dnn_input, time_embedding)
>>> print(output.shape)
torch.Size([4, 1, n_fft, frames]) # Shape depends on input dimensions.NOTE
The input tensor spec should have the last dimension representing the time frames and the preceding dimensions corresponding to the batch size and frequency channels.
