espnet2.enh.layers.tcn.TemporalConvNetInformed

About 2 min

espnet2.enh.layers.tcn.TemporalConvNetInformed

class espnet2.enh.layers.tcn.TemporalConvNetInformed(N, B, H, P, X, R, Sc=None, out_channel=None, norm_type='gLN', causal=False, pre_mask_nonlinear='prelu', mask_nonlinear='relu', i_adapt_layer: int = 7, adapt_layer_type: str = 'mul', adapt_enroll_dim: int = 128, **adapt_layer_kwargs)

Bases: TemporalConvNet

Basic Module of TasNet with adaptation layers.

This class extends the basic TemporalConvNet to include adaptation layers that modify the output based on the speaker embedding provided. It is designed for speech separation tasks.

Args: : N: Number of filters in autoencoder. B: Number of channels in bottleneck 1 * 1-conv block. H: Number of channels in convolutional blocks. P: Kernel size in convolutional blocks. X: Number of convolutional blocks in each repeat. R: Number of repeats. Sc: Number of channels in skip-connection paths’ 1x1-conv blocks. out_channel: Number of output channels. If None, N will be used. norm_type: Normalization type; options are BN, gLN, cLN. causal: If True, the model will be causal. pre_mask_nonlinear: Non-linear function before masknet. mask_nonlinear: Non-linear function to generate mask. i_adapt_layer: Index of the adaptation layer. adapt_layer_type: Type of adaptation layer. 
See espnet2.enh.layers.adapt_layers for options. adapt_enroll_dim: Dimensionality of the speaker embedding.

Raises: : ValueError: If an unsupported mask non-linear function is specified.

Examples: : # Create an instance of TemporalConvNetInformed model = TemporalConvNetInformed( 
N=256, B=128, H=256, P=3, X=8, R=3, Sc=64, out_channel=256, norm_type=”gLN”, causal=True, pre_mask_nonlinear=”prelu”, mask_nonlinear=”relu”, i_adapt_layer=7, adapt_layer_type=”mul”, adapt_enroll_dim=128 ) 
Forward pass with mixture and enrollment embedding
mixture = torch.randn(32, 256, 100) # [M, N, K] enroll_emb = torch.randn(32, 256) # [M, adapt_enroll_dim] output_mask = model(mixture, enroll_emb)

Basic Module of TasNet with adaptation layers.

Parameters:
- N – Number of filters in autoencoder
- B – Number of channels in bottleneck 1 * 1-conv block
- H – Number of channels in convolutional blocks
- P – Kernel size in convolutional blocks
- X – Number of convolutional blocks in each repeat
- R – Number of repeats
- Sc – Number of channels in skip-connection paths’ 1x1-conv blocks
- out_channel – Number of output channels if it is None, N will be used instead.
- norm_type – BN, gLN, cLN
- causal – causal or non-causal
- pre_mask_nonlinear – the non-linear function before masknet
- mask_nonlinear – use which non-linear function to generate mask
- i_adapt_layer – int, index of the adaptation layer
- adapt_layer_type – str, type of adaptation layer see espnet2.enh.layers.adapt_layers for options
- adapt_enroll_dim – int, dimensionality of the speaker embedding

forward(mixture_w, enroll_emb)

TasNet forward with adaptation layers.

This method processes the input mixture of waveforms and produces an estimated mask using a temporal convolutional network. It is designed to support adaptation layers that allow the model to better handle variations in the input data based on speaker embeddings.

Parameters:
- mixture_w – A tensor of shape [M, N, K], where M is the batch size, N is the number of input channels, and K is the sequence length.
- enroll_emb – A tensor that represents the speaker embedding, with shape [M, 2*adapt_enroll_dim] if skip connections are used, or [M, adapt_enroll_dim] if not.
Returns: A tensor of shape [M, N, K], representing the estimated : mask for the input mixture.
Return type: est_mask
Raises:ValueError – If an unsupported mask non-linear function is specified.

Examples

>>> model = TemporalConvNetInformed(N=64, B=32, H=16, P=3, X=4, R=2)
>>> mixture_w = torch.randn(8, 64, 100)  # Example input
>>> enroll_emb = torch.randn(8, 128)  # Example embedding
>>> estimated_mask = model(mixture_w, enroll_emb)
>>> print(estimated_mask.shape)  # Output: torch.Size([8, 64, 100])