espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock

About 2 min

espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock

class espnet2.enh.layers.dcunet.DCUNetComplexEncoderBlock(in_chan, out_chan, kernel_size, stride, padding, dilation, norm_type='bN', activation='leaky_relu', embed_dim=None, complex_time_embedding=False, temb_layers=1, temb_activation='silu')

Bases: Module

DCUNet Complex Encoder Block.

This block is a key component of the DCUNet architecture, which is used for speech enhancement tasks. It performs complex-valued convolutional operations, normalization, and activation functions. The encoder block processes the input features and incorporates time embedding if specified.

in_chan

Number of input channels.

Type: int

out_chan

Number of output channels.

Type: int

kernel_size

Size of the convolutional kernel.

Type: tuple

stride

Stride of the convolution.

Type: tuple

padding

Padding applied to the input.

Type: tuple

dilation

Dilation applied to the convolution.

Type: tuple

temb_layers

Number of time embedding layers.

Type: int

temb_activation

Activation function for time embedding.

Type: str

complex_time_embedding

Whether to use complex time embedding.

Type: bool

conv

Complex convolutional layer.

Type:ComplexConv2d

norm

Normalization layer.

Type: nn.Module

activation

Activation function.

Type: nn.Module

embed_dim

Dimension of the embedding space.

Type: int

embed_layer

Sequential layer for embedding.

Type: nn.Sequential
Parameters:
- in_chan (int) – Number of input channels.
- out_chan (int) – Number of output channels.
- kernel_size (tuple) – Size of the convolutional kernel.
- stride (tuple) – Stride of the convolution.
- padding (tuple) – Padding applied to the input.
- dilation (tuple) – Dilation applied to the convolution.
- norm_type (str) – Type of normalization to use (default: “bN”).
- activation (str) – Activation function to use (default: “leaky_relu”).
- embed_dim (int , optional) – Dimension of the embedding space.
- complex_time_embedding (bool , optional) – Whether to use complex time embedding (default: False).
- temb_layers (int , optional) – Number of time embedding layers (default: 1).
- temb_activation (str , optional) – Activation function for time embedding (default: “silu”).
Returns: None

####### Examples

>>> encoder_block = DCUNetComplexEncoderBlock(1, 32, (3, 3), (1, 1),
...                                             (1, 1), (1, 1))
>>> input_tensor = torch.randn(4, 1, 64, 64) + 1j * torch.randn(4, 1, 64, 64)
>>> output = encoder_block(input_tensor, None)
>>> print(output.shape)
torch.Size([4, 32, 64, 64])

NOTE

The input tensor should be complex-valued and have the shape (batch_size, channels, height, width).

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x, t_embed)

Processes the input complex spectrogram and time embedding through the

DCUNet architecture.

The input shape is expected to be $(batch, nfreqs, time)$, where $nfreqs - 1$ is divisible by $f_0 * f_1 * … * f_N$ where $f_k$ are the frequency strides of the encoders, and $time - 1$ is divisible by $t_0 * t_1 * … * t_N$ where $t_N$ are the time strides of the encoders.

Parameters:
- spec (Tensor) – Complex spectrogram tensor. It can be a 1D, 2D, or 3D tensor, with time being the last dimension.
- t (Tensor) – Time embedding tensor to provide additional context for the processing.
Returns: Output tensor, of shape (batch, time) or (time) after processing : through the network.
Return type: Tensor

####### Examples

>>> net = DCUNet()
>>> dnn_input = torch.randn(4, 2, 257, 256) + 1j * torch.randn(4, 2, 257, 256)
>>> time_embedding = torch.randn(4)
>>> output = net(dnn_input, time_embedding)
>>> print(output.shape)
torch.Size([4, 1, n_fft, frames])  # Shape depends on input dimensions.

NOTE

The input tensor spec should have the last dimension representing the time frames and the preceding dimensions corresponding to the batch size and frequency channels.