espnet2.enh.layers.dnn_beamformer.DNN_Beamformer
espnet2.enh.layers.dnn_beamformer.DNN_Beamformer
class espnet2.enh.layers.dnn_beamformer.DNN_Beamformer(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = -1, beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, mwf_mu: float = 1.0, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True, use_torchaudio_api: bool = False, btaps: int = 5, bdelay: int = 3)
Bases: Module
DNN mask based Beamformer.
This class implements a deep neural network (DNN) based beamformer for enhancing speech signals in multi-channel audio. The beamformer utilizes various algorithms, including MVDR, MPDR, and WPD, and is capable of estimating masks for speech and noise.
Citation: : Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf
mask
An instance of the MaskEstimator used for estimating masks for beamforming.
- Type:MaskEstimator
ref
An optional attention-based reference used for beamforming.
- Type:AttentionReference or None
ref
Index of the reference channel for beamforming.
- Type: int
use_noise_mask
Flag indicating whether to use noise mask.
- Type: bool
num_spk
Number of speakers to separate.
- Type: int
nmask
Number of masks to be estimated.
- Type: int
beamformer_type
Type of beamformer to use (e.g., “mvdr_souden”).
- Type: str
rtf_iterations
Number of iterations for estimating the RTF.
- Type: int
mwf_mu
Weight for noise suppression in SDW-MWF.
- Type: float
btaps
Number of taps for WPD beamformer.
- Type: int
bdelay
Delay for WPD beamformer.
- Type: int
eps
Small value to avoid division by zero.
- Type: float
diagonal_loading
Flag for applying diagonal loading.
- Type: bool
diag_eps
Small value for diagonal loading.
- Type: float
mask
Flag for applying mask flooring.
- Type: bool
flooring_thres
Threshold for mask flooring.
- Type: float
use_torch_solver
Flag indicating whether to use Torch solver.
Type: bool
Parameters:
- bidim (int) – Input feature dimension.
- btype (str) – Type of DNN architecture (default: “blstmp”).
- blayers (int) – Number of layers in the DNN (default: 3).
- bunits (int) – Number of units in each DNN layer (default: 300).
- bprojs (int) – Number of projections (default: 320).
- num_spk (int) – Number of speakers (default: 1).
- use_noise_mask (bool) – Whether to use noise mask (default: True).
- nonlinear (str) – Nonlinear activation function (default: “sigmoid”).
- dropout_rate (float) – Dropout rate (default: 0.0).
- badim (int) – Dimension for attention reference (default: 320).
- ref_channel (int) – Index of reference channel (default: -1).
- beamformer_type (str) – Type of beamformer (default: “mvdr_souden”).
- rtf_iterations (int) – Number of iterations for RTF estimation (default: 2).
- mwf_mu (float) – Noise suppression weight for SDW-MWF (default: 1.0).
- eps (float) – Small constant to prevent division by zero (default: 1e-6).
- diagonal_loading (bool) – Flag for diagonal loading (default: True).
- diag_eps (float) – Small value for diagonal loading (default: 1e-7).
- mask_flooring (bool) – Flag for applying mask flooring (default: False).
- flooring_thres (float) – Threshold for mask flooring (default: 1e-6).
- use_torch_solver (bool) – Use Torch solver (default: True).
- use_torchaudio_api (bool) – Use torchaudio API (default: False).
- btaps (int) – Number of taps for WPD (default: 5).
- bdelay (int) – Delay for WPD (default: 3).
########### Examples
Initialize the DNN_Beamformer
beamformer = DNN_Beamformer(bidim=128, num_spk=2)
Forward pass through the beamformer
enhanced, ilens, masks = beamformer(data, ilens)
- Raises:ValueError – If an unsupported beamformer type is provided or if the number of speakers is less than 1.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
apply_beamforming(data, ilens, psd_n, psd_speech, psd_distortion=None, rtf_mat=None, spk=0)
Beamforming with the provided statistics.
This method applies beamforming techniques using the provided noise and speech covariance matrices, as well as other optional parameters. The implementation varies based on the type of beamformer being used, such as MVDR, MPDR, WPD, and others.
- Parameters:
- data (torch.complex64/ComplexTensor) – Input tensor of shape (B, F, C, T), where B is the batch size, F is the number of frequency bins, C is the number of channels, and T is the time dimension.
- ilens (torch.Tensor) – A tensor of shape (B,) representing the lengths of the input sequences.
- psd_n (torch.complex64/ComplexTensor) – Noise covariance matrix for MVDR (shape: (B, F, C, C)), observation covariance matrix for MPDR/wMPDR, or stacked observation covariance for WPD (shape: (B, F, (btaps+1)*C, (btaps+1)*C)).
- psd_speech (torch.complex64/ComplexTensor) – Speech covariance matrix (shape: (B, F, C, C)).
- psd_distortion (torch.complex64/ComplexTensor , optional) – Distortion covariance matrix (shape: (B, F, C, C)).
- rtf_mat (torch.complex64/ComplexTensor , optional) – RTF matrix (shape: (B, F, C, num_spk)).
- spk (int , optional) – Speaker index. Default is 0.
- Returns: Enhanced output tensor of shape (B, F, T). ws (torch.complex64/ComplexTensor):
Weight vectors of shape (B, F) or (B, F, (btaps+1)*C).
- Return type: enhanced (torch.complex64/ComplexTensor)
- Raises:ValueError – If the beamformer type is not supported.
########### Examples
>>> beamformer = DNN_Beamformer(...)
>>> enhanced, weights = beamformer.apply_beamforming(data, ilens, psd_n,
... psd_speech)
####### NOTE The implementation of beamforming is contingent upon the specified beamformer type and may involve various computational techniques to optimize performance based on the statistics provided.
forward(data: Tensor | ComplexTensor, ilens: LongTensor, powers: List[Tensor] | None = None, oracle_masks: List[Tensor] | None = None) → Tuple[Tensor | ComplexTensor, LongTensor, Tensor]
DNN_Beamformer forward function.
This method performs the forward pass for the DNN-based beamformer, applying the beamforming process to the input data. It takes in the data, input lengths, optional power spectra, and oracle masks, and produces enhanced signals, updated input lengths, and estimated masks.
Notation: : B: Batch C: Channel T: Time or Sequence length F: Frequency
- Parameters:
- data (torch.complex64/ComplexTensor) – Input tensor of shape (B, T, C, F).
- ilens (torch.Tensor) – Input lengths of shape (B,).
- powers (List *[*torch.Tensor ] or None) – Optional power spectra used for wMPDR or WPD with shape (B, F, T).
- oracle_masks (List *[*torch.Tensor ] or None) – Optional oracle masks of shape (B, F, C, T). If provided, these masks will be used instead of the computed masks.
- Returns: Enhanced output of shape (B, T, F). ilens (torch.Tensor): Updated input lengths of shape (B,). masks (torch.Tensor): Estimated masks of shape (B, T, C, F).
- Return type: enhanced (torch.complex64/ComplexTensor)
########### Examples
>>> data = torch.randn(4, 160, 2, 64, dtype=torch.complex64)
>>> ilens = torch.tensor([160, 160, 160, 160])
>>> enhanced, ilens, masks = model.forward(data, ilens)
####### NOTE The forward method assumes that the input data is complex and that the beamforming statistics are properly initialized in the DNN_Beamformer class. If oracle masks are provided, they should have the correct shape to match the number of channels in the input data.
- Raises:ValueError – If the specified beamformer type is not supported.
predict_mask(data: Tensor | ComplexTensor, ilens: LongTensor) → Tuple[Tuple[Tensor, ...], LongTensor]
Predict masks for beamforming.
This method takes input data and estimates the masks required for beamforming based on the learned model parameters. The input data should be a complex tensor representing the signals to be processed.
- Parameters:
- data (torch.complex64/ComplexTensor) – Input data of shape (B, T, C, F) where:
- B: Batch size
- T: Time or sequence length
- C: Number of channels
- F: Number of frequency bins
- ilens (torch.Tensor) – Tensor of shape (B,) representing the actual lengths of each input sequence in the batch.
- data (torch.complex64/ComplexTensor) – Input data of shape (B, T, C, F) where:
- Returns: A tuple containing:
- masks (torch.Tensor):
Estimated masks of shape (B, T, C, F) used for beamforming.
- ilens (torch.LongTensor): : The input lengths tensor of shape (B,).
- Return type: Tuple[Tuple[torch.Tensor, …], torch.LongTensor]
########### Examples
>>> model = DNN_Beamformer(...)
>>> data = torch.randn(8, 100, 2, 256, dtype=torch.complex64)
>>> ilens = torch.tensor([100] * 8)
>>> masks, lengths = model.predict_mask(data, ilens)
>>> print(masks[0].shape) # Should print torch.Size([100, 2, 256])
####### NOTE The input data should be preprocessed and in double precision for optimal performance. The output masks can then be used for enhancing the input signals using various beamforming techniques.
- Raises:
- ValueError – If the input data dimensions do not match the expected
- shape. –