espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock
espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock
class espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock(in_chan, out_chan, kernel_size, stride, padding, dilation, norm_type='bN', activation='leaky_relu', embed_dim=None, complex_time_embedding=False, temb_layers=1, temb_activation='silu')
Bases: Module
DCUNet Complex Encoder Block.
This block is a key component of the DCUNet architecture, which is used for speech enhancement tasks. It performs complex-valued convolutional operations, normalization, and activation functions. The encoder block processes the input features and incorporates time embedding if specified.
in_chan
Number of input channels.
- Type: int
out_chan
Number of output channels.
- Type: int
kernel_size
Size of the convolutional kernel.
- Type: tuple
stride
Stride of the convolution.
- Type: tuple
padding
Padding applied to the input.
- Type: tuple
dilation
Dilation applied to the convolution.
- Type: tuple
temb_layers
Number of time embedding layers.
- Type: int
temb_activation
Activation function for time embedding.
- Type: str
complex_time_embedding
Whether to use complex time embedding.
- Type: bool
conv
Complex convolutional layer.
- Type:ComplexConv2d
norm
Normalization layer.
- Type: nn.Module
activation
Activation function.
- Type: nn.Module
embed_dim
Dimension of the embedding space.
- Type: int
embed_layer
Sequential layer for embedding.
Type: nn.Sequential
Parameters:
- in_chan (int) – Number of input channels.
- out_chan (int) – Number of output channels.
- kernel_size (tuple) – Size of the convolutional kernel.
- stride (tuple) – Stride of the convolution.
- padding (tuple) – Padding applied to the input.
- dilation (tuple) – Dilation applied to the convolution.
- norm_type (str) – Type of normalization to use (default: “bN”).
- activation (str) – Activation function to use (default: “leaky_relu”).
- embed_dim (int , optional) – Dimension of the embedding space.
- complex_time_embedding (bool , optional) – Whether to use complex time embedding (default: False).
- temb_layers (int , optional) – Number of time embedding layers (default: 1).
- temb_activation (str , optional) – Activation function for time embedding (default: “silu”).
Returns: None
####### Examples
>>> encoder_block = DCUNetComplexEncoderBlock(1, 32, (3, 3), (1, 1),
... (1, 1), (1, 1))
>>> input_tensor = torch.randn(4, 1, 64, 64) + 1j * torch.randn(4, 1, 64, 64)
>>> output = encoder_block(input_tensor, None)
>>> print(output.shape)
torch.Size([4, 32, 64, 64])
NOTE
The input tensor should be complex-valued and have the shape (batch_size, channels, height, width).
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x, t_embed)
Processes the input complex spectrogram and time embedding through the
DCUNet architecture.
The input shape is expected to be $(batch, nfreqs, time)$, where $nfreqs - 1$ is divisible by $f_0 * f_1 * … * f_N$ where $f_k$ are the frequency strides of the encoders, and $time - 1$ is divisible by $t_0 * t_1 * … * t_N$ where $t_N$ are the time strides of the encoders.
- Parameters:
- spec (Tensor) – Complex spectrogram tensor. It can be a 1D, 2D, or 3D tensor, with time being the last dimension.
- t (Tensor) – Time embedding tensor to provide additional context for the processing.
- Returns: Output tensor, of shape (batch, time) or (time) after processing : through the network.
- Return type: Tensor
####### Examples
>>> net = DCUNet()
>>> dnn_input = torch.randn(4, 2, 257, 256) + 1j * torch.randn(4, 2, 257, 256)
>>> time_embedding = torch.randn(4)
>>> output = net(dnn_input, time_embedding)
>>> print(output.shape)
torch.Size([4, 1, n_fft, frames]) # Shape depends on input dimensions.
NOTE
The input tensor spec should have the last dimension representing the time frames and the preceding dimensions corresponding to the batch size and frequency channels.