espnet2.enh.separator.neural_beamformer.NeuralBeamformer
espnet2.enh.separator.neural_beamformer.NeuralBeamformer
class espnet2.enh.separator.neural_beamformer.NeuralBeamformer(input_dim: int, num_spk: int = 1, loss_type: str = 'mask_mse', use_wpe: bool = False, wnet_type: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, wnonlinear: str = 'crelu', multi_source_wpe: bool = True, wnormalization: bool = False, use_beamformer: bool = True, bnet_type: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = -1, use_noise_mask: bool = True, bnonlinear: str = 'sigmoid', beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, bdropout_rate: float = 0.0, shared_power: bool = True, use_torchaudio_api: bool = False, diagonal_loading: bool = True, diag_eps_wpe: float = 1e-07, diag_eps_bf: float = 1e-07, mask_flooring: bool = False, flooring_thres_wpe: float = 1e-06, flooring_thres_bf: float = 1e-06, use_torch_solver: bool = True)
Bases: AbsSeparator
NeuralBeamformer is a neural network-based separator that performs speech
enhancement using a combination of dereverberation and beamforming techniques. It extends the functionality of the AbsSeparator class and incorporates Deep Neural Network (DNN) methods for both dereverberation and beamforming.
num_spk
The number of speakers to separate.
- Type: int
loss_type
The type of loss function used for training. Supported types include “mask_mse”, “spectrum”, “spectrum_log”, and “magnitude”.
- Type: str
use_beamformer
Flag indicating whether to use beamforming.
- Type: bool
use_wpe
Flag indicating whether to use dereverberation via WPE.
- Type: bool
shared_power
Indicates if speech powers should be shared between WPE and beamforming.
Type: bool
Parameters:
- input_dim (int) – The dimension of the input feature.
- num_spk (int , optional) – Number of speakers to separate. Defaults to 1.
- loss_type (str , optional) – Loss function type. Defaults to “mask_mse”.
- use_wpe (bool , optional) – Use WPE for dereverberation. Defaults to False.
- wnet_type (str , optional) – Type of WPE network. Defaults to “blstmp”.
- wlayers (int , optional) – Number of WPE network layers. Defaults to 3.
- wunits (int , optional) – Number of units in WPE network layers. Defaults to 300.
- wprojs (int , optional) – Number of projections in WPE. Defaults to 320.
- wdropout_rate (float , optional) – Dropout rate for WPE. Defaults to 0.0.
- taps (int , optional) – Number of taps for WPE. Defaults to 5.
- delay (int , optional) – Delay for WPE. Defaults to 3.
- use_dnn_mask_for_wpe (bool , optional) – Use DNN for WPE mask estimation. Defaults to True.
- wnonlinear (str , optional) – Nonlinearity type for WPE. Defaults to “crelu”.
- multi_source_wpe (bool , optional) – Use multi-source WPE. Defaults to True.
- wnormalization (bool , optional) – Normalize WPE outputs. Defaults to False.
- use_beamformer (bool , optional) – Use beamformer. Defaults to True.
- bnet_type (str , optional) – Type of beamformer network. Defaults to “blstmp”.
- blayers (int , optional) – Number of beamformer network layers. Defaults to 3.
- bunits (int , optional) – Number of units in beamformer network layers. Defaults to 300.
- bprojs (int , optional) – Number of projections in beamformer. Defaults to 320.
- badim (int , optional) – Dimensionality of beamformer input. Defaults to 320.
- ref_channel (int , optional) – Reference channel for beamforming. Defaults to -1.
- use_noise_mask (bool , optional) – Use noise mask in beamforming. Defaults to True.
- bnonlinear (str , optional) – Nonlinearity type for beamformer. Defaults to “sigmoid”.
- beamformer_type (str , optional) – Type of beamformer. Defaults to “mvdr_souden”.
- rtf_iterations (int , optional) – Number of iterations for RTF. Defaults to 2.
- bdropout_rate (float , optional) – Dropout rate for beamformer. Defaults to 0.0.
- shared_power (bool , optional) – Share speech powers between WPE and beamforming. Defaults to True.
- use_torchaudio_api (bool , optional) – Use Torchaudio API. Defaults to False.
- diagonal_loading (bool , optional) – Use diagonal loading for stability. Defaults to True.
- diag_eps_wpe (float , optional) – Epsilon for WPE diagonal loading. Defaults to 1e-7.
- diag_eps_bf (float , optional) – Epsilon for beamformer diagonal loading. Defaults to 1e-7.
- mask_flooring (bool , optional) – Apply mask flooring. Defaults to False.
- flooring_thres_wpe (float , optional) – Threshold for WPE flooring. Defaults to 1e-6.
- flooring_thres_bf (float , optional) – Threshold for beamformer flooring. Defaults to 1e-6.
- use_torch_solver (bool , optional) – Use Torch solver for computations. Defaults to True.
Returns:
- enhanced speech (single-channel): List of enhanced tensors.
- output lengths: Lengths of the output tensors.
- other predicted data: An OrderedDict containing various masks and dereverberated outputs.
Return type: Tuple[List[Union[torch.Tensor, ComplexTensor]], torch.Tensor, OrderedDict]
Raises:ValueError – If an unsupported loss type is provided during initialization.
####### Examples
>>> model = NeuralBeamformer(input_dim=257, num_spk=2)
>>> mixed_speech = torch.randn(4, 100, 2, 257, dtype=torch.complex64)
>>> ilens = torch.tensor([100, 100, 100, 100])
>>> enhanced, output_lengths, masks = model(mixed_speech, ilens)
NOTE
The additional argument in the forward method is not utilized in this implementation.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor | ComplexTensor, ilens: Tensor, additional: Dict | None = None) → Tuple[List[Tensor | ComplexTensor], Tensor, OrderedDict]
Perform the forward pass of the NeuralBeamformer model.
This method processes the input mixed speech and returns the enhanced speech along with other predicted data. It handles both training and inference modes, estimating masks during training for memory efficiency.
- Parameters:
- input (torch.complex64/ComplexTensor) – Mixed speech tensor of shape [Batch, Frames, Channel, Freq] or [Batch, Frames, Freq].
- ilens (torch.Tensor) – Tensor of input lengths with shape [Batch].
- additional (Dict or None) – Additional data included in the model (not used in this model).
- Returns:
- enhanced speech (single-channel): : List[torch.complex64/ComplexTensor]
- output lengths: : torch.Tensor
- other predicted data: : OrderedDict[ : ‘dereverb1’: ComplexTensor(Batch, Frames, Channel, Freq), ‘mask_dereverb1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_noise1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Channel, Freq), <br/> ]
- Return type: Tuple
####### Examples
>>> model = NeuralBeamformer(input_dim=256, num_spk=2)
>>> mixed_speech = torch.randn(8, 100, 2, 256, dtype=torch.complex64)
>>> lengths = torch.tensor([100]*8)
>>> enhanced, output_lengths, other_data = model.forward(mixed_speech, lengths)
NOTE
The method estimates masks only during training for memory efficiency. In inference mode, it performs enhancement without mask estimation.
- Raises:AssertionError – If the input dimension is not 3 or 4.
property num_spk