espnet2.enh.separator.dptnet_separator.DPTNetSeparator
espnet2.enh.separator.dptnet_separator.DPTNetSeparator
class espnet2.enh.separator.dptnet_separator.DPTNetSeparator(input_dim: int, post_enc_relu: bool = True, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, unit: int = 256, att_heads: int = 4, dropout: float = 0.0, activation: str = 'relu', norm_type: str = 'gLN', layer: int = 6, segment_size: int = 20, nonlinear: str = 'relu')
Bases: AbsSeparator
Dual-Path Transformer Network (DPTNet) Separator for audio source separation.
This class implements the DPTNet architecture for separating audio sources based on input features. It utilizes a dual-path strategy to efficiently process audio signals and estimate the masks for multiple speakers.
num_spk
The number of speakers for separation.
- Type: int
predict_noise
Indicates if the estimated noise signal should be output.
- Type: bool
segment_size
The size of the segments used in dual-path processing.
- Type: int
post_enc_relu
If True, applies ReLU activation after encoding.
- Type: bool
enc_LN
Normalization layer applied after encoding.
num_outputs
The number of outputs, including the estimated noise if applicable.
- Type: int
dptnet
The DPTNet model instance used for processing.
output
The gated output layer for generating filters.
output
The gate layer for controlling output activation.
nonlinear
The nonlinear function used for mask estimation.
- Parameters:
- input_dim (int) – Input feature dimension.
- post_enc_relu (bool) – If True, applies ReLU after encoding. Default is True.
- rnn_type (str) – Select from ‘RNN’, ‘LSTM’, or ‘GRU’. Default is ‘lstm’.
- bidirectional (bool) – Whether inter-chunk RNN layers are bidirectional. Default is True.
- num_spk (int) – Number of speakers. Default is 2.
- predict_noise (bool) – Whether to output the estimated noise signal. Default is False.
- unit (int) – Dimension of the hidden state. Default is 256.
- att_heads (int) – Number of attention heads. Default is 4.
- dropout (float) – Dropout ratio. Default is 0.0.
- activation (str) – Activation function applied at the output of RNN. Default is ‘relu’.
- norm_type (str) – Type of normalization to use after Transformer blocks. Default is ‘gLN’.
- layer (int) – Number of stacked RNN layers. Default is 6.
- segment_size (int) – Dual-path segment size. Default is 20.
- nonlinear (str) – Nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’. Default is ‘relu’.
- Raises:ValueError – If nonlinear is not one of ‘sigmoid’, ‘relu’, or ‘tanh’.
########### Examples
Initialize the DPTNetSeparator
separator = DPTNetSeparator(input_dim=256, num_spk=2, predict_noise=True)
Forward pass through the separator
masked, ilens, others = separator.forward(input_tensor, input_lengths)
Access the estimated masks
mask_spk1 = others[‘mask_spk1’] mask_spk2 = others[‘mask_spk2’] noise_estimate = others.get(‘noise1’, None)
Dual-Path Transformer Network (DPTNet) Separator
- Parameters:
- input_dim – input feature dimension
- rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
- bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
- num_spk – number of speakers
- predict_noise – whether to output the estimated noise signal
- unit – int, dimension of the hidden state.
- att_heads – number of attention heads.
- dropout – float, dropout ratio. Default is 0.
- activation – activation function applied at the output of RNN.
- norm_type – type of normalization to use after each inter- or intra-chunk Transformer block.
- nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer – int, number of stacked RNN layers. Default is 3.
- segment_size – dual-path segment size
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Forward pass for the DPTNetSeparator.
This method processes the input features through the DPTNet architecture, applying necessary transformations and returning the masked outputs along with the predicted masks for each speaker.
- Parameters:
- input (Union *[*torch.Tensor , ComplexTensor ]) – Encoded feature of shape [B, T, N], where B is the batch size, T is the time frames, and N is the feature dimension.
- ilens (torch.Tensor) – Input lengths of shape [Batch].
- additional (Optional *[*Dict ]) – Other data included in the model. NOTE: This parameter is not used in this model.
- Returns:
- masked (List[Union[torch.Tensor, ComplexTensor]]): List of masked outputs, each of shape [(B, T, N), …].
- ilens (torch.Tensor): Tensor of input lengths with shape (B,).
- others (OrderedDict): Dictionary containing predicted data, e.g. masks: : - ’mask_spk1’: torch.Tensor of shape (Batch, Frames, Freq),
- ’mask_spk2’: torch.Tensor of shape (Batch, Frames, Freq), <br/> …
- ‘mask_spkn’: torch.Tensor of shape (Batch, Frames, Freq).
- Return type: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, OrderedDict]
########### Examples
>>> separator = DPTNetSeparator(input_dim=128)
>>> input_tensor = torch.randn(10, 100, 128) # Batch of 10
>>> ilens = torch.tensor([100] * 10) # All sequences of length 100
>>> masked, lengths, others = separator(input_tensor, ilens)
NOTE
This method is designed to handle both real and complex input tensors.
merge_feature(x, length=None)
Merge feature chunks back into a single feature sequence.
This method takes the output of the dual-path processing and merges the feature chunks into a single sequence using a folding operation. It handles both cases where the output length is specified or needs to be inferred from the number of chunks.
- Parameters:
- x (torch.Tensor) – Input tensor of shape (B, N, L, n_chunks) where: B - batch size, N - number of feature channels, L - length of each feature chunk, n_chunks - number of chunks to merge.
- length (Optional *[*int ]) – Desired length of the output sequence. If None, the length is calculated based on the number of chunks and segment size.
- Returns: Merged feature tensor of shape (B, N, length).
- Return type: torch.Tensor
NOTE
The output is normalized by the number of overlapping segments used during the merge process.
########### Examples
>>> separator = DPTNetSeparator(input_dim=128)
>>> x = torch.randn(2, 64, 10, 4) # Example input
>>> merged_features = separator.merge_feature(x, length=40)
>>> print(merged_features.shape) # Output: torch.Size([2, 64, 40])
- Raises:
- ValueError – If the input tensor x does not have the expected
- dimensions. –
property num_spk
split_feature(x)
Dual-Path Transformer Network (DPTNet) Separator.
This class implements a DPTNet separator for audio source separation tasks. It leverages a dual-path architecture that processes audio features for multiple speakers, optionally estimating noise signals.
_num_spk
Number of speakers.
- Type: int
predict_noise
Whether to output the estimated noise signal.
- Type: bool
segment_size
Dual-path segment size.
- Type: int
post_enc_relu
Apply ReLU after encoding.
- Type: bool
enc_LN
Normalization layer.
num_outputs
Number of outputs (speakers + noise).
- Type: int
dptnet
Instance of the DPTNet class for processing.
output
Gated output layer for filter generation.
output
Gated output layer for controlling output.
nonlinear
Nonlinear activation function for mask estimation.
- Parameters:
- input_dim (int) – Input feature dimension.
- post_enc_relu (bool) – If True, apply ReLU after encoding.
- rnn_type (str) – Type of RNN (‘RNN’, ‘LSTM’, ‘GRU’).
- bidirectional (bool) – If True, use bidirectional RNN layers.
- num_spk (int) – Number of speakers to separate.
- predict_noise (bool) – If True, output the estimated noise signal.
- unit (int) – Dimension of the hidden state.
- att_heads (int) – Number of attention heads.
- dropout (float) – Dropout ratio. Default is 0.
- activation (str) – Activation function applied at RNN output.
- norm_type (str) – Type of normalization to use.
- layer (int) – Number of stacked RNN layers. Default is 3.
- segment_size (int) – Size of each segment in dual-path processing.
- nonlinear (str) – Nonlinear function for mask estimation (‘relu’, ‘tanh’, ‘sigmoid’).
- Raises:ValueError – If an unsupported nonlinear function is provided.
########### Examples
>>> separator = DPTNetSeparator(input_dim=256, num_spk=2)
>>> input_tensor = torch.randn(10, 20, 256) # Batch, Time, Feature
>>> ilens = torch.tensor([20] * 10) # Input lengths
>>> masked, ilens, others = separator(input_tensor, ilens)