espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor

About 4 min

espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor

class espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor(feature_extractor: Module, normalization_module: Module, decoding_window: int, encoder_sub_factor: int, frontend_conf: Dict, device: device, audio_sampling_rate: int = 16000)

Bases: object

Online processor for Transducer models chunk-by-chunk streaming decoding.

This class provides an online audio processing module designed to handle streaming audio input for Transducer models. It processes audio samples chunk-by-chunk and computes features required for speech recognition.

n_fft

Number of FFT components.

Type: int

hop_sz

Hop size for feature extraction.

Type: int

win_sz

Window size for feature extraction.

Type: int

win_hop_sz

Window hop size for feature extraction.

Type: int

trim_val

Trim value for features.

Type: int

decoding_samples

Number of samples in the decoding window.

Type: int

offset_frames

Number of frames to offset for feature extraction.

Type: int

feature_extractor

Module to extract features from audio.

Type: torch.nn.Module

normalization_module

Module to normalize extracted features.

Type: torch.nn.Module

device

Device for tensor operations (CPU or GPU).

Type: torch.device

samples

Cached audio samples for processing.

Type: torch.Tensor

samples

_length

Length of the cached audio samples.

Type: torch.Tensor

feats

Cached features for processing.

Type: torch.Tensor
Parameters:
- feature_extractor (torch.nn.Module) – Feature extractor module.
- normalization_module (torch.nn.Module) – Normalization module.
- decoding_window (int) – Size of the decoding window (in ms).
- encoder_sub_factor (int) – Encoder subsampling factor.
- frontend_conf (Dict) – Frontend configuration dictionary.
- device (torch.device) – Device to pin module tensors on.
- audio_sampling_rate (int , optional) – Input sampling rate (default: 16000).

############# Examples

Initialize the OnlineAudioProcessor

processor = OnlineAudioProcessor(

feature_extractor=my_feature_extractor, normalization_module=my_normalization_module, decoding_window=25, encoder_sub_factor=4, frontend_conf={“n_fft”: 512, “hop_length”: 128, “win_sz”: 512}, device=torch.device(“cuda”), audio_sampling_rate=16000

)

Reset cache parameters

processor.reset_cache()

Process audio samples

audio_samples = torch.randn(32000) # Simulated audio samples is_final_chunk = False features, features_length = processor.compute_features(audio_samples, is_final_chunk)

Notes

The input audio samples should be a 1D tensor of shape (S), where S is the number of audio samples.

Raises:ValueError – If any of the input arguments are invalid.

Construct an OnlineAudioProcessor.

compute_features(samples: Tensor, is_final: bool) → None

Compute features from input samples.

This method processes the input speech samples to extract features using the feature extractor module. It also handles normalization if a normalization module is provided. The function maintains state between calls, allowing it to work with streaming audio data.

Parameters:
- samples – Speech data. (S)
- is_final – Whether speech corresponds to the final chunk of data.
Returns: Features sequence. (1, chunk_sz_bs, D_feats) feats_length: Features length sequence. (1,)
Return type: feats

############# Examples

>>> processor = OnlineAudioProcessor(feature_extractor,
...                                    normalization_module,
...                                    decoding_window=20,
...                                    encoder_sub_factor=4,
...                                    frontend_conf=frontend_config,
...                                    device=torch.device('cpu'))
>>> samples = torch.randn(16000)  # 1 second of audio
>>> feats, feats_length = processor.compute_features(samples,
...                                                  is_final=False)

######## NOTE The method assumes that the feature extractor and normalization module are already defined and compatible with the expected input dimensions.

Raises:
- ValueError – If the input samples are not of the expected
- dimensions or type. –

get_current_feats(feats: Tensor, feats_length: Tensor, is_final: bool) → Tuple[Tensor, Tensor]

Get features for current decoding window.

This method processes the computed features sequence to prepare the features for the current decoding window. It handles both final and non-final chunks of data, adjusting the features accordingly.

Parameters:
- feats – Computed features sequence. (1, F, D_feats)
- feats_length – Computed features sequence length. (1,)
- is_final – Whether feats corresponds to the final chunk of data.
Returns: Decoding window features sequence. (1, chunk_sz_bs, D_feats) feats_length: Decoding window features length sequence. (1,)
Return type: feats

############# Examples

>>> feats = torch.randn(1, 10, 64)  # Example feature tensor
>>> feats_length = torch.tensor([10])  # Example length tensor
>>> is_final = False
>>> feats_out, feats_length_out = processor.get_current_feats(feats, feats_length, is_final)

######## NOTE If is_final is set to True, the method adjusts the features by trimming them based on the trim_val attribute. For non-final chunks, the features are processed to exclude the trimmed sections.

get_current_samples(samples: Tensor, is_final: bool) → Tensor

Get samples for feature computation.

This method processes the incoming audio samples to prepare them for feature extraction. It handles both final and intermediate chunks of audio data by ensuring the appropriate padding and reshaping.

Parameters:
- samples – A tensor containing the speech data. Shape (S,) where S is the number of samples.
- is_final – A boolean indicating whether the provided samples correspond to the final chunk of data.
Returns: A tensor containing the new speech data reshaped to (1, decoding_samples), where decoding_samples is the size of the decoding window in samples.

############# Examples

>>> processor = OnlineAudioProcessor(...)
>>> audio_chunk = torch.randn(3000)  # Simulated audio samples
>>> final_chunk = processor.get_current_samples(audio_chunk, is_final=True)
>>> final_chunk.shape
torch.Size([1, 1600])  # Assuming decoding_samples is 1600

######## NOTE If is_final is set to True and the number of incoming samples is less than the required decoding_samples, the method will pad the samples with zeros to meet the required length.

Raises:ValueError – If the input tensor samples is empty.

reset_cache() → None

Reset cache parameters.

This method clears the internal cache of samples and features used during audio processing. It is typically called when starting a new processing session or when the existing cache needs to be refreshed.

samples

A tensor that holds the current audio samples.

samples

_length

A tensor that tracks the length of the current samples.

feats

A tensor that holds the current features extracted from the audio samples.

Parameters:None
Returns: None

############# Examples

Create an instance of OnlineAudioProcessor

processor = OnlineAudioProcessor(feature_extractor, normalization_module,

decoding_window, encoder_sub_factor, frontend_conf, device)

Reset the cache before processing new audio data

processor.reset_cache()

######## NOTE This method does not take any parameters and does not return anything. It is primarily for internal use within the OnlineAudioProcessor class.