espnet2.diar.layers.multi_mask.MultiMask

About 3 min

espnet2.diar.layers.multi_mask.MultiMask

class espnet2.diar.layers.multi_mask.MultiMask(input_dim: int, bottleneck_dim: int = 128, max_num_spk: int = 3, mask_nonlinear='relu')

Bases: AbsMask

Multiple 1x1 convolution layer Module.

This module corresponds to the final 1x1 convolution block and non-linear function in TCNSeparator. It has multiple 1x1 convolution blocks, one of which is selected according to the specified number of speakers to handle a flexible number of speakers.

Parameters:
- input_dim (int) – Number of filters in the autoencoder.
- bottleneck_dim (int , optional) – Number of channels in the bottleneck 1x1 convolution block. Defaults to 128.
- max_num_spk (int , optional) – Maximum number of mask_conv1x1 modules (should be >= maximum number of speakers in the dataset). Defaults to 3.
- mask_nonlinear (str , optional) – Non-linear function to use for generating masks. Defaults to “relu”.

max_num_spk

The maximum number of speakers supported by the model.

Type: int

######### Examples

>>> model = MultiMask(input_dim=256, bottleneck_dim=128,
...                    max_num_spk=3, mask_nonlinear='relu')
>>> input_tensor = torch.randn(10, 64, 256)  # (M, K, N)
>>> ilens = torch.tensor([64] * 10)  # Lengths for each input
>>> bottleneck_feat = torch.randn(10, 64, 128)  # (M, K, B)
>>> masked, ilens_out, others = model(input_tensor, ilens,
...                                     bottleneck_feat, num_spk=2)

Raises:ValueError – If an unsupported mask non-linear function is specified.
Returns: Tuple[List[Union[torch.Tensor, ComplexTensor]], : > torch.Tensor, OrderedDict]:
- masked (List[Union[torch.Tensor, ComplexTensor]]): : List of masked outputs for each speaker.
- ilens (torch.Tensor): Lengths of the input sequences.
- others (OrderedDict): Additional predicted data, : including masks for each speaker.

Multiple 1x1 convolution layer Module.

This module corresponds to the final 1x1 conv block and non-linear function in TCNSeparator. This module has multiple 1x1 conv blocks. One of them is selected according to the given num_spk to handle flexible num_spk.

Parameters:
- input_dim – Number of filters in autoencoder
- bottleneck_dim – Number of channels in bottleneck 1 * 1-conv block
- max_num_spk – Number of mask_conv1x1 modules (>= Max number of speakers in the dataset)
- mask_nonlinear – use which non-linear function to generate mask

forward(input: Tensor | ComplexTensor, ilens: Tensor, bottleneck_feat: Tensor, num_spk: int) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]

Processes input through the multiple 1x1 convolution layers.

This method applies the forward pass for the MultiMask module, which consists of multiple 1x1 convolution layers that generate masks for separating audio signals from different speakers. The number of masks generated corresponds to the specified number of speakers.

Parameters:
- input – A tensor of shape [M, K, N], where M is the batch size, K is the number of frequency bins, and N is the number of time steps.
- ilens (torch.Tensor) – A tensor of shape (M,) containing the lengths of the input sequences.
- bottleneck_feat – A tensor of shape [M, K, B], representing the bottleneck features, where B is the bottleneck dimension.
- num_spk – An integer indicating the number of speakers (training: oracle, inference: estimated by another module).
Returns: A list of tensors of shape [(M, K, N), …], where each tensor corresponds to the input masked by the estimated speaker masks. ilens (torch.Tensor): A tensor of shape (M,) containing the
lengths of the input sequences.
others (OrderedDict): An ordered dictionary containing additional predicted data, such as masks for each speaker:
- ’mask_spk1’: torch.Tensor(Batch, Frames, Freq),
- ’mask_spk2’: torch.Tensor(Batch, Frames, Freq),
…
- ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq).
Return type: masked (List[Union[torch.Tensor, ComplexTensor]])
Raises:ValueError – If the specified non-linear function for mask generation is unsupported.

######### Examples

>>> input_tensor = torch.randn(2, 64, 128)  # Example input
>>> ilens = torch.tensor([128, 128])  # Example input lengths
>>> bottleneck_feat = torch.randn(2, 64, 128)  # Example bottleneck
>>> num_spk = 2  # Number of speakers
>>> masked_output, lengths, masks = multi_mask.forward(
...     input_tensor, ilens, bottleneck_feat, num_spk)

NOTE

This API is designed to be compatible with the TasNet framework.

property max_num_spk : int

Multiple 1x1 convolution layer Module.

This module corresponds to the final 1x1 conv block and non-linear function in TCNSeparator. It consists of multiple 1x1 convolution blocks, allowing for flexible handling of a varying number of speakers.

max_num_spk

Maximum number of speakers that can be processed by this module.

Type: int

mask_nonlinear

The non-linear activation function used to generate the masks.

Type: str
Parameters:
- input_dim (int) – Number of filters in the autoencoder.
- bottleneck_dim (int , optional) – Number of channels in the bottleneck 1x1 convolution block. Defaults to 128.
- max_num_spk (int , optional) – Maximum number of mask_conv1x1 modules (>= Max number of speakers in the dataset). Defaults to 3.
- mask_nonlinear (str , optional) – Non-linear function to use for mask generation. Defaults to “relu”.

######### Examples

>>> multi_mask = MultiMask(input_dim=64, bottleneck_dim=128,
...                          max_num_spk=3, mask_nonlinear="relu")
>>> input_tensor = torch.randn(10, 128, 64)  # [M, K, N]
>>> ilens = torch.tensor([64] * 10)  # lengths of the inputs
>>> bottleneck_feat = torch.randn(10, 64, 128)  # [M, K, B]
>>> masked, ilens, others = multi_mask(input_tensor, ilens,
...                                      bottleneck_feat, num_spk=2)

Raises:ValueError – If an unsupported mask non-linear function is specified.

NOTE

This module is part of the MultiMask package and is intended for use in speech separation tasks, particularly in scenarios involving multiple speakers.