espnet2.enh.separator.dptnet_separator.DPTNetSeparator

About 5 min

espnet2.enh.separator.dptnet_separator.DPTNetSeparator

class espnet2.enh.separator.dptnet_separator.DPTNetSeparator(input_dim: int, post_enc_relu: bool = True, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, unit: int = 256, att_heads: int = 4, dropout: float = 0.0, activation: str = 'relu', norm_type: str = 'gLN', layer: int = 6, segment_size: int = 20, nonlinear: str = 'relu')

Bases: AbsSeparator

Dual-Path Transformer Network (DPTNet) Separator for audio source separation.

This class implements the DPTNet architecture for separating audio sources based on input features. It utilizes a dual-path strategy to efficiently process audio signals and estimate the masks for multiple speakers.

num_spk

The number of speakers for separation.

Type: int

predict_noise

Indicates if the estimated noise signal should be output.

Type: bool

segment_size

The size of the segments used in dual-path processing.

Type: int

post_enc_relu

If True, applies ReLU activation after encoding.

Type: bool

enc_LN

Normalization layer applied after encoding.

num_outputs

The number of outputs, including the estimated noise if applicable.

Type: int

dptnet

The DPTNet model instance used for processing.

output

The gated output layer for generating filters.

output

_gate

The gate layer for controlling output activation.

nonlinear

The nonlinear function used for mask estimation.

Parameters:
- input_dim (int) – Input feature dimension.
- post_enc_relu (bool) – If True, applies ReLU after encoding. Default is True.
- rnn_type (str) – Select from ‘RNN’, ‘LSTM’, or ‘GRU’. Default is ‘lstm’.
- bidirectional (bool) – Whether inter-chunk RNN layers are bidirectional. Default is True.
- num_spk (int) – Number of speakers. Default is 2.
- predict_noise (bool) – Whether to output the estimated noise signal. Default is False.
- unit (int) – Dimension of the hidden state. Default is 256.
- att_heads (int) – Number of attention heads. Default is 4.
- dropout (float) – Dropout ratio. Default is 0.0.
- activation (str) – Activation function applied at the output of RNN. Default is ‘relu’.
- norm_type (str) – Type of normalization to use after Transformer blocks. Default is ‘gLN’.
- layer (int) – Number of stacked RNN layers. Default is 6.
- segment_size (int) – Dual-path segment size. Default is 20.
- nonlinear (str) – Nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’. Default is ‘relu’.
Raises:ValueError – If nonlinear is not one of ‘sigmoid’, ‘relu’, or ‘tanh’.

########### Examples

Initialize the DPTNetSeparator

separator = DPTNetSeparator(input_dim=256, num_spk=2, predict_noise=True)

Forward pass through the separator

masked, ilens, others = separator.forward(input_tensor, input_lengths)

Access the estimated masks

mask_spk1 = others[‘mask_spk1’] mask_spk2 = others[‘mask_spk2’] noise_estimate = others.get(‘noise1’, None)

Dual-Path Transformer Network (DPTNet) Separator

Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- unit – int, dimension of the hidden state.
- att_heads – number of attention heads.
- dropout – float, dropout ratio. Default is 0.
- activation – activation function applied at the output of RNN.
- norm_type – type of normalization to use after each inter- or intra-chunk Transformer block.
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- segment_size – dual-path segment size

forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]

Forward pass for the DPTNetSeparator.

This method processes the input features through the DPTNet architecture, applying necessary transformations and returning the masked outputs along with the predicted masks for each speaker.

Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature of shape [B, T, N], where B is the batch size, T is the time frames, and N is the feature dimension.
- ilens (torch.Tensor) – Input lengths of shape [Batch].
- additional (Optional *[*Dict ]) – Other data included in the model. NOTE: This parameter is not used in this model.
Returns:
- masked (List[Union[torch.Tensor, ComplexTensor]]): List of masked outputs, each of shape [(B, T, N), …].
- ilens (torch.Tensor): Tensor of input lengths with shape (B,).
- others (OrderedDict): Dictionary containing predicted data, e.g. masks: : - ’mask_spk1’: torch.Tensor of shape (Batch, Frames, Freq),
  - ’mask_spk2’: torch.Tensor of shape (Batch, Frames, Freq), <br/> …
  - ‘mask_spkn’: torch.Tensor of shape (Batch, Frames, Freq).
Return type: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, OrderedDict]

########### Examples

>>> separator = DPTNetSeparator(input_dim=128)
>>> input_tensor = torch.randn(10, 100, 128)  # Batch of 10
>>> ilens = torch.tensor([100] * 10)  # All sequences of length 100
>>> masked, lengths, others = separator(input_tensor, ilens)

NOTE

This method is designed to handle both real and complex input tensors.

merge_feature(x, length=None)

Merge feature chunks back into a single feature sequence.

This method takes the output of the dual-path processing and merges the feature chunks into a single sequence using a folding operation. It handles both cases where the output length is specified or needs to be inferred from the number of chunks.

Parameters:
- x (torch.Tensor) – Input tensor of shape (B, N, L, n_chunks) where: B - batch size, N - number of feature channels, L - length of each feature chunk, n_chunks - number of chunks to merge.
- length (Optional *[*int ]) – Desired length of the output sequence. If None, the length is calculated based on the number of chunks and segment size.
Returns: Merged feature tensor of shape (B, N, length).
Return type: torch.Tensor

NOTE

The output is normalized by the number of overlapping segments used during the merge process.

########### Examples

>>> separator = DPTNetSeparator(input_dim=128)
>>> x = torch.randn(2, 64, 10, 4)  # Example input
>>> merged_features = separator.merge_feature(x, length=40)
>>> print(merged_features.shape)  # Output: torch.Size([2, 64, 40])

Raises:
- ValueError – If the input tensor x does not have the expected
- dimensions. –

property num_spk

split_feature(x)

Dual-Path Transformer Network (DPTNet) Separator.

This class implements a DPTNet separator for audio source separation tasks. It leverages a dual-path architecture that processes audio features for multiple speakers, optionally estimating noise signals.

_num_spk

Number of speakers.

Type: int

predict_noise

Whether to output the estimated noise signal.

Type: bool

segment_size

Dual-path segment size.

Type: int

post_enc_relu

Apply ReLU after encoding.

Type: bool

enc_LN

Normalization layer.

num_outputs

Number of outputs (speakers + noise).

Type: int

dptnet

Instance of the DPTNet class for processing.

output

Gated output layer for filter generation.

output

_gate

Gated output layer for controlling output.

nonlinear

Nonlinear activation function for mask estimation.

Parameters:
- input_dim (int) – Input feature dimension.
- post_enc_relu (bool) – If True, apply ReLU after encoding.
- rnn_type (str) – Type of RNN (‘RNN’, ‘LSTM’, ‘GRU’).
- bidirectional (bool) – If True, use bidirectional RNN layers.
- num_spk (int) – Number of speakers to separate.
- predict_noise (bool) – If True, output the estimated noise signal.
- unit (int) – Dimension of the hidden state.
- att_heads (int) – Number of attention heads.
- dropout (float) – Dropout ratio. Default is 0.
- activation (str) – Activation function applied at RNN output.
- norm_type (str) – Type of normalization to use.
- layer (int) – Number of stacked RNN layers. Default is 3.
- segment_size (int) – Size of each segment in dual-path processing.
- nonlinear (str) – Nonlinear function for mask estimation (‘relu’, ‘tanh’, ‘sigmoid’).
Raises:ValueError – If an unsupported nonlinear function is provided.

########### Examples

>>> separator = DPTNetSeparator(input_dim=256, num_spk=2)
>>> input_tensor = torch.randn(10, 20, 256)  # Batch, Time, Feature
>>> ilens = torch.tensor([20] * 10)  # Input lengths
>>> masked, ilens, others = separator(input_tensor, ilens)