espnet2.asr.frontend.asteroid_frontend.AsteroidFrontend
espnet2.asr.frontend.asteroid_frontend.AsteroidFrontend
class espnet2.asr.frontend.asteroid_frontend.AsteroidFrontend(sinc_filters: int = 256, sinc_kernel_size: int = 251, sinc_stride: int = 16, preemph_coef: float = 0.97, log_term: float = 1e-06)
Bases: AbsFrontend
AsteroidFrontend class for audio feature extraction using Sinc-convolution.
This class implements a Sinc-convolutional-based audio feature extractor designed for tasks such as sentence-level classification. It utilizes a parameterized analytic filterbank layer to process raw audio input data and extract meaningful features.
The functionality of this class can also be achieved by combining a sliding window frontend with a Sinc preencoder.
sinc_filters
Number of filters for Sinc convolution.
- Type: int
sinc_kernel_size
Kernel size for Sinc convolution.
- Type: int
sinc_stride
Stride size for the first Sinc convolution layer.
- Type: int
preemph_coef
Coefficient for preemphasis applied to the input.
- Type: float
log_term
A small constant added to prevent log of zero.
Type: float
Parameters:
- sinc_filters (int) – The number of Sinc filters. Default is 256.
- sinc_kernel_size (int) – The kernel size for Sinc convolution. Default is 251.
- sinc_stride (int) – The stride size for the Sinc convolution layer. Default is 16.
- preemph_coef (float) – The coefficient for preemphasis. Default is 0.97.
- log_term (float) – The log term to prevent infinity. Default is 1e-6.
forward(input
torch.Tensor, input_length: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: Applies the Asteroid filterbank frontend to the input tensor and returns the frame-wise output along with the adjusted input lengths.
output_size() → int
Returns the output length of the feature dimension.
######### Examples
>>> frontend = AsteroidFrontend(sinc_filters=128, sinc_kernel_size=251)
>>> input_tensor = torch.randn(10, 16000) # (B, T)
>>> input_length = torch.tensor([16000] * 10) # (B,)
>>> output, output_length = frontend(input_tensor, input_length)
NOTE
This class is primarily used for tasks related to speech separation and classification. Other applications are not thoroughly examined.
- Raises:AssertionError – If the input tensor does not have 2 dimensions.
Initialize.
- Parameters:
- sinc_filters – the filter numbers for sinc.
- sinc_kernel_size – the kernel size for sinc.
- sinc_stride – the sincstride size of the first sinc-conv layer where it decides the compression rate (Hz).
- preemph_coef – the coeifficient for preempahsis.
- log_term – the log term to prevent infinity.
#
forward(input
Apply the Asteroid filterbank frontend to the input audio data.
This method processes the input audio tensor using the Asteroid filterbank to extract frame-wise features suitable for downstream tasks. It includes preemphasis, normalization, and feature extraction through a Sinc-based convolutional layer.
- Parameters:
- input – A tensor of shape (B, T) representing the audio input, where B is the batch size and T is the length of the audio sequence.
- input_length – A tensor of shape (B,) containing the lengths of each audio sequence in the batch.
- Returns:
- Tensor: Frame-wise output of shape (B, T’, D), where T’ is the number of frames after processing and D is the output feature dimension.
- Tensor: Updated input lengths after processing, of shape (B,).
- Return type: A tuple containing
- Raises:AssertionError – If the input tensor does not have 2 dimensions.
######### Examples
>>> frontend = AsteroidFrontend()
>>> audio_input = torch.randn(4, 16000) # Batch of 4, 1 second audio
>>> input_length = torch.tensor([16000, 16000, 16000, 16000])
>>> output, new_lengths = frontend(audio_input, input_length)
>>> output.shape
torch.Size([4, T', 256]) # Example output shape
>>> new_lengths
tensor([T', T', T', T']) # Updated lengths after processing
NOTE
This function is primarily used in sentence-level classification tasks such as speaker recognition. Other use cases may not be fully explored.
output_size() → int
Return the output size of the feature dimension.
This method provides the size of the output features generated by the Asteroid filterbank frontend. The output size corresponds to the number of sinc filters specified during the initialization of the AsteroidFrontend class.
- Returns: The number of sinc filters used in the feature extraction.
- Return type: int
######### Examples
>>> frontend = AsteroidFrontend(sinc_filters=256)
>>> output_size = frontend.output_size()
>>> print(output_size)
256