espnet2.asr.preencoder.sinc.LightweightSincConvs

About 4 min

espnet2.asr.preencoder.sinc.LightweightSincConvs

class espnet2.asr.preencoder.sinc.LightweightSincConvs(fs: int | str | float = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')

Bases: AbsPreEncoder

Lightweight Sinc Convolutions for end-to-end speech recognition.

This class implements lightweight Sinc convolutions to process raw audio input directly for speech recognition, as described in the paper “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. (https://arxiv.org/abs/2010.07597).

The architecture processes audio through a series of convolutional blocks that utilize Sinc filters, followed by normalization and pooling layers. To integrate this pre-encoder in your model, specify preencoder: sinc and use frontend: sliding_window in your YAML configuration file. The data flow is as follows:

Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder

This method performs data augmentation in the time domain, contrasting with the spectral domain approach of the default frontend. For visualizing the learned Sinc filters, utilize plot_sinc_filters.py.

Sample rate of the input audio.

Type: int

in_channels

Number of input channels.

Type: int

out_channels

Number of output channels per input channel.

Type: int

activation_type

Type of activation function to use.

Type: str

dropout_type

Type of dropout function to use.

Type: str

windowing_type

Type of windowing function to use.

Type: str

scale_type

Type of filter-bank initialization scale.

Type: str
Parameters:
- fs (Union *[*int , str , float ]) – Sample rate. Defaults to 16000.
- in_channels (int) – Number of input channels. Defaults to 1.
- out_channels (int) – Number of output channels. Defaults to 256.
- activation_type (str) – Activation function type. Defaults to “leakyrelu”.
- dropout_type (str) – Dropout function type. Defaults to “dropout”.
- windowing_type (str) – Windowing function type. Defaults to “hamming”.
- scale_type (str) – Filter-bank initialization scale type. Defaults to “mel”.
Raises:
- NotImplementedError – If the specified dropout or activation type is not
- supported. –

############# Examples

Initialize the pre-encoder with default parameters

sinc_preencoder = LightweightSincConvs()

Initialize with custom parameters

sinc_preencoder = LightweightSincConvs(

fs=16000, in_channels=2, out_channels=128, activation_type=’relu’, dropout_type=’spatial’

)

Forward pass with input tensor

input_tensor = torch.randn(32, 100, 1, 400) # (B, T, C_in, D_in) output_tensor, lengths = sinc_preencoder(input_tensor, input_lengths)

######## NOTE This class relies on PyTorch and is designed to be compatible with ESPnet’s architecture for speech processing.

Initialize the module.

Parameters:
- fs – Sample rate.
- in_channels – Number of input channels.
- out_channels – Number of output channels (for each input channel).
- activation_type – Choice of activation function.
- dropout_type – Choice of dropout function.
- windowing_type – Choice of windowing function.
- scale_type – Choice of filter-bank initialization scale.

espnet_initialization_fn()

Initialize sinc filters with filterbank values.

This function initializes the sinc filters used in the Lightweight Sinc Convolutions by setting their values based on the filterbank initialization. It also sets the weights and biases of all BatchNorm layers in the model to ensure that they start with a neutral effect during training.

The initialization process involves the following steps:

Call the init_filters() method on the filters attribute to

initialize the sinc filters.

Iterate through all the blocks of the model. For each block, check if it contains a BatchNorm layer with affine parameters enabled. If so, set the layer’s weight to 1.0 and bias to 0.0.

######## NOTE This method should be called after creating the sinc convolutions and before using the model for forward propagation.

############# Examples

>>> model = LightweightSincConvs()
>>> model.espnet_initialization_fn()  # Initialize filters

Raises:NotImplementedError – If the filterbank initialization fails.

forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]

Apply Lightweight Sinc Convolutions.

This method processes the input tensor using lightweight Sinc convolutions, transforming the input audio features into output features suitable for subsequent layers in the neural network.

The input tensor should be formatted as (B, T, C_in, D_in), where:

B: Batch size
T: Time dimension
C_in: Number of input channels
D_in: Feature dimension (should be 400 for current implementation)

The output tensor will be shaped as (B, T, C_out * D_out), where:

C_out: Number of output channels, as specified during

initialization

D_out: Output feature dimension, which is 1 in this case.

######## NOTE The current implementation only supports D_in=400, leading to D_out=1. For multichannel input, C_out will be the product of the initialized out_channels and C_in.

Parameters:
- input (torch.Tensor) – Input tensor of shape (B, T, C_in, D_in).
- input_lengths (torch.Tensor) – Lengths of the input sequences.
Returns: A tuple containing: : - Output tensor of shape (B, T, C_out * D_out).
- Input lengths tensor.
Return type: Tuple[torch.Tensor, torch.Tensor]

############# Examples

>>> model = LightweightSincConvs()
>>> input_tensor = torch.randn(8, 100, 1, 400)  # Example input
>>> input_lengths = torch.tensor([100] * 8)  # Example lengths
>>> output, lengths = model.forward(input_tensor, input_lengths)
>>> output.shape
torch.Size([8, 100, 256])  # Example output shape

Raises:
- ValueError – If the input tensor does not have the expected
- shape. –

gen_lsc_block(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)

Generate a convolutional block for Lightweight Sinc convolutions.

Each block consists of either a depthwise or a depthwise-separable convolution along with dropout, (batch-)normalization layer, and an optional average-pooling layer. This structure is designed to efficiently process audio data while maintaining the integrity of the signal through various transformations.

Parameters:
- in_channels (int) – Number of input channels.
- out_channels (int) – Number of output channels.
- depthwise_kernel_size (int , optional) – Kernel size of the depthwise convolution. Default is 9.
- depthwise_stride (int , optional) – Stride of the depthwise convolution. Default is 1.
- depthwise_groups (int , optional) – Number of groups for the depthwise convolution. If None, will be set to GCD of in_channels and out_channels.
- pointwise_groups (int , optional) – Number of groups for the pointwise convolution. Default is 0 (no grouping).
- dropout_probability (float , optional) – Dropout probability in the block. Default is 0.15.
- avgpool (bool , optional) – If True, an AvgPool layer is inserted. Default is False.
Returns: A sequential block containing the defined layers, ready to be used in a neural network architecture.
Return type: torch.nn.Sequential

############# Examples

>>> lsc_block = gen_lsc_block(
...     in_channels=64,
...     out_channels=128,
...     depthwise_kernel_size=5,
...     avgpool=True
... )
>>> print(lsc_block)
Sequential(
  (depthwise): Conv1d(64, 128, kernel_size=(5,), stride=(1,),
  groups=64)
  (activation): LeakyReLU(negative_slope=0.01)
  (batchnorm): BatchNorm1d(128, eps=1e-05, momentum=0.1,
  affine=True, track_running_stats=True)
  (avgpool): AvgPool1d(kernel_size=2, stride=2, padding=0)
  (dropout): Dropout(p=0.15, inplace=False)
)

######## NOTE The use of depthwise separable convolutions allows for a more efficient network structure by reducing the number of parameters and computational cost compared to standard convolutions.

Raises:
- NotImplementedError – If the specified depthwise or pointwise
- groups do not meet the required conditions. –

output_size() → int

Get the output size of the Lightweight Sinc Convolutions.

This method calculates the output size based on the number of output channels and input channels defined during initialization. The output size is determined by the formula:

output_size = out_channels * in_channels

This output size represents the number of features produced by the Lightweight Sinc Convolutions for each input sample.

Returns: The computed output size.
Return type: int

############# Examples

>>> sinc_convs = LightweightSincConvs(in_channels=2, out_channels=256)
>>> output_size = sinc_convs.output_size()
>>> print(output_size)
512  # Since 2 (in_channels) * 256 (out_channels) = 512