espnet2.enh.layers.uses.ChannelTAC

About 2 min

espnet2.enh.layers.uses.ChannelTAC

class espnet2.enh.layers.uses.ChannelTAC(input_dim, eps=1e-05)

Bases: Module

Channel Transform-Average-Concatenate (TAC) module.

This module performs a series of transformations on the input feature to enhance the channel representation by transforming, averaging, and concatenating the features. It is particularly useful in scenarios where channel interactions are significant for the overall performance of the model.

transform

A sequential module that transforms the input features to a higher-dimensional space.

Type: nn.Sequential

average

A sequential module that averages the transformed features.

Type: nn.Sequential

concat

A sequential module that concatenates the transformed and averaged features and applies layer normalization.

Type: nn.Sequential
Parameters:
- input_dim (int) – Dimension of the input feature.
- eps (float) – Epsilon for layer normalization, to avoid division by zero.

####### Examples

>>> tac = ChannelTAC(input_dim=128)
>>> input_tensor = torch.randn(32, 4, 128, 64, 64)  # (batch, C, N, freq, time)
>>> output_tensor = tac(input_tensor)
>>> output_tensor.shape
torch.Size([32, 4, 128, 64, 64])

Returns: Output feature (batch, C, N, freq, time) after : transformation, averaging, concatenation, and residual addition.
Return type: output (torch.Tensor)

Channel Transform-Average-Concatenate (TAC) module.

Parameters:
- input_dim (int) – dimension of the input feature.
- eps (float) – epsilon for layer normalization.

forward(x, ref_channel=None)

Unconstrained Speech Enhancement and Separation (USES) Network.

Reference: : [1] W. Zhang, K. Saijo, Z.-Q. Wang, S. Watanabe, and Y. Qian, “Toward Universal Speech Enhancement for Diverse Input Conditions,” in Proc. ASRU, 2023.

Parameters:
- input_size (int) – Dimension of the input feature.
- output_size (int) – Dimension of the output.
- bottleneck_size (int) – Dimension of the bottleneck feature. Must be a multiple of att_heads.
- num_blocks (int) – Number of processing blocks.
- num_spatial_blocks (int) – Number of processing blocks with channel modeling.
- segment_size (int) – Number of frames in each non-overlapping segment. This is used to segment long utterances into smaller segments for efficient processing.
- memory_size (int) – Group size of global memory tokens. The basic use of memory tokens is to store the history information from previous segments. The memory tokens are updated by the output of the last block after processing each segment.
- memory_types (int) –
  Number of memory token groups. Each group corresponds to a different type of processing, i.e.,
  the first group is used for denoising without dereverberation, the second group is used for denoising with dereverberation.
- rnn_type (str) – Type of the RNN cell in the improved Transformer layer.
- hidden_size (int) – Hidden dimension of the RNN cell.
- att_heads (int) – Number of attention heads in Transformer.
- dropout (float) – Dropout ratio. Default is 0.
- activation (str) – Non-linear activation function applied in each block.
- bidirectional (bool) – Whether the RNN layers are bidirectional.
- norm_type (str) – Normalization type in the improved Transformer layer.
- ch_mode (str) – Mode of channel modeling. Select from “att” and “tac”.
- ch_att_dim (int) – Dimension of the channel attention.
- eps (float) – Epsilon for layer normalization.

####### Examples

>>> model = USES(input_size=128, output_size=64)
>>> input_tensor = torch.randn(10, 2, 128, 64, 50)  # (batch, mics, input_size, freq, time)
>>> output = model(input_tensor)
>>> output.shape
torch.Size([10, 64, 64, 50])  # (batch, output_size, freq, time)