espnet2.enh.layers.dnn_beamformer.DNN_Beamformer

About 5 min

espnet2.enh.layers.dnn_beamformer.DNN_Beamformer

class espnet2.enh.layers.dnn_beamformer.DNN_Beamformer(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = -1, beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, mwf_mu: float = 1.0, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True, use_torchaudio_api: bool = False, btaps: int = 5, bdelay: int = 3)

Bases: Module

DNN mask based Beamformer.

This class implements a deep neural network (DNN) based beamformer for enhancing speech signals in multi-channel audio. The beamformer utilizes various algorithms, including MVDR, MPDR, and WPD, and is capable of estimating masks for speech and noise.

Citation: : Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf

mask

An instance of the MaskEstimator used for estimating masks for beamforming.

Type:MaskEstimator

ref

An optional attention-based reference used for beamforming.

Type:AttentionReference or None

ref

_channel

Index of the reference channel for beamforming.

Type: int

use_noise_mask

Flag indicating whether to use noise mask.

Type: bool

num_spk

Number of speakers to separate.

Type: int

nmask

Number of masks to be estimated.

Type: int

beamformer_type

Type of beamformer to use (e.g., “mvdr_souden”).

Type: str

rtf_iterations

Number of iterations for estimating the RTF.

Type: int

mwf_mu

Weight for noise suppression in SDW-MWF.

Type: float

btaps

Number of taps for WPD beamformer.

Type: int

bdelay

Delay for WPD beamformer.

Type: int

eps

Small value to avoid division by zero.

Type: float

diagonal_loading

Flag for applying diagonal loading.

Type: bool

diag_eps

Small value for diagonal loading.

Type: float

mask

_flooring

Flag for applying mask flooring.

Type: bool

flooring_thres

Threshold for mask flooring.

Type: float

use_torch_solver

Flag indicating whether to use Torch solver.

Type: bool
Parameters:
- bidim (int) – Input feature dimension.
- btype (str) – Type of DNN architecture (default: “blstmp”).
- blayers (int) – Number of layers in the DNN (default: 3).
- bunits (int) – Number of units in each DNN layer (default: 300).
- bprojs (int) – Number of projections (default: 320).
- num_spk (int) – Number of speakers (default: 1).
- use_noise_mask (bool) – Whether to use noise mask (default: True).
- nonlinear (str) – Nonlinear activation function (default: “sigmoid”).
- dropout_rate (float) – Dropout rate (default: 0.0).
- badim (int) – Dimension for attention reference (default: 320).
- ref_channel (int) – Index of reference channel (default: -1).
- beamformer_type (str) – Type of beamformer (default: “mvdr_souden”).
- rtf_iterations (int) – Number of iterations for RTF estimation (default: 2).
- mwf_mu (float) – Noise suppression weight for SDW-MWF (default: 1.0).
- eps (float) – Small constant to prevent division by zero (default: 1e-6).
- diagonal_loading (bool) – Flag for diagonal loading (default: True).
- diag_eps (float) – Small value for diagonal loading (default: 1e-7).
- mask_flooring (bool) – Flag for applying mask flooring (default: False).
- flooring_thres (float) – Threshold for mask flooring (default: 1e-6).
- use_torch_solver (bool) – Use Torch solver (default: True).
- use_torchaudio_api (bool) – Use torchaudio API (default: False).
- btaps (int) – Number of taps for WPD (default: 5).
- bdelay (int) – Delay for WPD (default: 3).

########### Examples

Initialize the DNN_Beamformer

beamformer = DNN_Beamformer(bidim=128, num_spk=2)

Forward pass through the beamformer

enhanced, ilens, masks = beamformer(data, ilens)

Raises:ValueError – If an unsupported beamformer type is provided or if the number of speakers is less than 1.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

apply_beamforming(data, ilens, psd_n, psd_speech, psd_distortion=None, rtf_mat=None, spk=0)

Beamforming with the provided statistics.

This method applies beamforming techniques using the provided noise and speech covariance matrices, as well as other optional parameters. The implementation varies based on the type of beamformer being used, such as MVDR, MPDR, WPD, and others.

Parameters:
- data (torch.complex64/ComplexTensor) – Input tensor of shape (B, F, C, T), where B is the batch size, F is the number of frequency bins, C is the number of channels, and T is the time dimension.
- ilens (torch.Tensor) – A tensor of shape (B,) representing the lengths of the input sequences.
- psd_n (torch.complex64/ComplexTensor) – Noise covariance matrix for MVDR (shape: (B, F, C, C)), observation covariance matrix for MPDR/wMPDR, or stacked observation covariance for WPD (shape: (B, F, (btaps+1)*C, (btaps+1)*C)).
- psd_speech (torch.complex64/ComplexTensor) – Speech covariance matrix (shape: (B, F, C, C)).
- psd_distortion (torch.complex64/ComplexTensor , optional) – Distortion covariance matrix (shape: (B, F, C, C)).
- rtf_mat (torch.complex64/ComplexTensor , optional) – RTF matrix (shape: (B, F, C, num_spk)).
- spk (int , optional) – Speaker index. Default is 0.
Returns: Enhanced output tensor of shape (B, F, T). ws (torch.complex64/ComplexTensor):
Weight vectors of shape (B, F) or (B, F, (btaps+1)*C).
Return type: enhanced (torch.complex64/ComplexTensor)
Raises:ValueError – If the beamformer type is not supported.

########### Examples

>>> beamformer = DNN_Beamformer(...)
>>> enhanced, weights = beamformer.apply_beamforming(data, ilens, psd_n,
... psd_speech)

####### NOTE The implementation of beamforming is contingent upon the specified beamformer type and may involve various computational techniques to optimize performance based on the statistics provided.

forward(data: Tensor | ComplexTensor, ilens: LongTensor, powers: List[Tensor] | None = None, oracle_masks: List[Tensor] | None = None) → Tuple[Tensor | ComplexTensor, LongTensor, Tensor]

DNN_Beamformer forward function.

This method performs the forward pass for the DNN-based beamformer, applying the beamforming process to the input data. It takes in the data, input lengths, optional power spectra, and oracle masks, and produces enhanced signals, updated input lengths, and estimated masks.

Notation: : B: Batch C: Channel T: Time or Sequence length F: Frequency

Parameters:
- data (torch.complex64/ComplexTensor) – Input tensor of shape (B, T, C, F).
- ilens (torch.Tensor) – Input lengths of shape (B,).
- powers (List *[*torch.Tensor ] or None) – Optional power spectra used for wMPDR or WPD with shape (B, F, T).
- oracle_masks (List *[*torch.Tensor ] or None) – Optional oracle masks of shape (B, F, C, T). If provided, these masks will be used instead of the computed masks.
Returns: Enhanced output of shape (B, T, F). ilens (torch.Tensor): Updated input lengths of shape (B,). masks (torch.Tensor): Estimated masks of shape (B, T, C, F).
Return type: enhanced (torch.complex64/ComplexTensor)

########### Examples

>>> data = torch.randn(4, 160, 2, 64, dtype=torch.complex64)
>>> ilens = torch.tensor([160, 160, 160, 160])
>>> enhanced, ilens, masks = model.forward(data, ilens)

####### NOTE The forward method assumes that the input data is complex and that the beamforming statistics are properly initialized in the DNN_Beamformer class. If oracle masks are provided, they should have the correct shape to match the number of channels in the input data.

Raises:ValueError – If the specified beamformer type is not supported.

predict_mask(data: Tensor | ComplexTensor, ilens: LongTensor) → Tuple[Tuple[Tensor, ...], LongTensor]

Predict masks for beamforming.

This method takes input data and estimates the masks required for beamforming based on the learned model parameters. The input data should be a complex tensor representing the signals to be processed.

Parameters:
- data (torch.complex64/ComplexTensor) – Input data of shape (B, T, C, F) where:
  - B: Batch size
  - T: Time or sequence length
  - C: Number of channels
  - F: Number of frequency bins
- ilens (torch.Tensor) – Tensor of shape (B,) representing the actual lengths of each input sequence in the batch.
Returns: A tuple containing:
- masks (torch.Tensor):
Estimated masks of shape (B, T, C, F) used for beamforming.
- ilens (torch.LongTensor): : The input lengths tensor of shape (B,).
Return type: Tuple[Tuple[torch.Tensor, …], torch.LongTensor]

########### Examples

>>> model = DNN_Beamformer(...)
>>> data = torch.randn(8, 100, 2, 256, dtype=torch.complex64)
>>> ilens = torch.tensor([100] * 8)
>>> masks, lengths = model.predict_mask(data, ilens)
>>> print(masks[0].shape)  # Should print torch.Size([100, 2, 256])

####### NOTE The input data should be preprocessed and in double precision for optimal performance. The output masks can then be used for enhancing the input signals using various beamforming techniques.

Raises:
- ValueError – If the input data dimensions do not match the expected
- shape. –