espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor
espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor
class espnet2.asr_transducer.frontend.online_audio_processor.OnlineAudioProcessor(feature_extractor: Module, normalization_module: Module, decoding_window: int, encoder_sub_factor: int, frontend_conf: Dict, device: device, audio_sampling_rate: int = 16000)
Bases: object
Online processor for Transducer models chunk-by-chunk streaming decoding.
This class provides an online audio processing module designed to handle streaming audio input for Transducer models. It processes audio samples chunk-by-chunk and computes features required for speech recognition.
n_fft
Number of FFT components.
- Type: int
hop_sz
Hop size for feature extraction.
- Type: int
win_sz
Window size for feature extraction.
- Type: int
win_hop_sz
Window hop size for feature extraction.
- Type: int
trim_val
Trim value for features.
- Type: int
decoding_samples
Number of samples in the decoding window.
- Type: int
offset_frames
Number of frames to offset for feature extraction.
- Type: int
feature_extractor
Module to extract features from audio.
- Type: torch.nn.Module
normalization_module
Module to normalize extracted features.
- Type: torch.nn.Module
device
Device for tensor operations (CPU or GPU).
- Type: torch.device
samples
Cached audio samples for processing.
- Type: torch.Tensor
samples
Length of the cached audio samples.
- Type: torch.Tensor
feats
Cached features for processing.
Type: torch.Tensor
Parameters:
- feature_extractor (torch.nn.Module) – Feature extractor module.
- normalization_module (torch.nn.Module) – Normalization module.
- decoding_window (int) – Size of the decoding window (in ms).
- encoder_sub_factor (int) – Encoder subsampling factor.
- frontend_conf (Dict) – Frontend configuration dictionary.
- device (torch.device) – Device to pin module tensors on.
- audio_sampling_rate (int , optional) – Input sampling rate (default: 16000).
############# Examples
Initialize the OnlineAudioProcessor
processor = OnlineAudioProcessor(
feature_extractor=my_feature_extractor, normalization_module=my_normalization_module, decoding_window=25, encoder_sub_factor=4, frontend_conf={“n_fft”: 512, “hop_length”: 128, “win_sz”: 512}, device=torch.device(“cuda”), audio_sampling_rate=16000
)
Reset cache parameters
processor.reset_cache()
Process audio samples
audio_samples = torch.randn(32000) # Simulated audio samples is_final_chunk = False features, features_length = processor.compute_features(audio_samples, is_final_chunk)
Notes
The input audio samples should be a 1D tensor of shape (S), where S is the number of audio samples.
- Raises:ValueError – If any of the input arguments are invalid.
Construct an OnlineAudioProcessor.
compute_features(samples: Tensor, is_final: bool) → None
Compute features from input samples.
This method processes the input speech samples to extract features using the feature extractor module. It also handles normalization if a normalization module is provided. The function maintains state between calls, allowing it to work with streaming audio data.
- Parameters:
- samples – Speech data. (S)
- is_final – Whether speech corresponds to the final chunk of data.
- Returns: Features sequence. (1, chunk_sz_bs, D_feats) feats_length: Features length sequence. (1,)
- Return type: feats
############# Examples
>>> processor = OnlineAudioProcessor(feature_extractor,
... normalization_module,
... decoding_window=20,
... encoder_sub_factor=4,
... frontend_conf=frontend_config,
... device=torch.device('cpu'))
>>> samples = torch.randn(16000) # 1 second of audio
>>> feats, feats_length = processor.compute_features(samples,
... is_final=False)
######## NOTE The method assumes that the feature extractor and normalization module are already defined and compatible with the expected input dimensions.
- Raises:
- ValueError – If the input samples are not of the expected
- dimensions or type. –
get_current_feats(feats: Tensor, feats_length: Tensor, is_final: bool) → Tuple[Tensor, Tensor]
Get features for current decoding window.
This method processes the computed features sequence to prepare the features for the current decoding window. It handles both final and non-final chunks of data, adjusting the features accordingly.
- Parameters:
- feats – Computed features sequence. (1, F, D_feats)
- feats_length – Computed features sequence length. (1,)
- is_final – Whether feats corresponds to the final chunk of data.
- Returns: Decoding window features sequence. (1, chunk_sz_bs, D_feats) feats_length: Decoding window features length sequence. (1,)
- Return type: feats
############# Examples
>>> feats = torch.randn(1, 10, 64) # Example feature tensor
>>> feats_length = torch.tensor([10]) # Example length tensor
>>> is_final = False
>>> feats_out, feats_length_out = processor.get_current_feats(feats, feats_length, is_final)
######## NOTE If is_final is set to True, the method adjusts the features by trimming them based on the trim_val attribute. For non-final chunks, the features are processed to exclude the trimmed sections.
get_current_samples(samples: Tensor, is_final: bool) → Tensor
Get samples for feature computation.
This method processes the incoming audio samples to prepare them for feature extraction. It handles both final and intermediate chunks of audio data by ensuring the appropriate padding and reshaping.
- Parameters:
- samples – A tensor containing the speech data. Shape (S,) where S is the number of samples.
- is_final – A boolean indicating whether the provided samples correspond to the final chunk of data.
- Returns: A tensor containing the new speech data reshaped to (1, decoding_samples), where decoding_samples is the size of the decoding window in samples.
############# Examples
>>> processor = OnlineAudioProcessor(...)
>>> audio_chunk = torch.randn(3000) # Simulated audio samples
>>> final_chunk = processor.get_current_samples(audio_chunk, is_final=True)
>>> final_chunk.shape
torch.Size([1, 1600]) # Assuming decoding_samples is 1600
######## NOTE If is_final is set to True and the number of incoming samples is less than the required decoding_samples, the method will pad the samples with zeros to meet the required length.
- Raises:ValueError – If the input tensor samples is empty.
reset_cache() → None
Reset cache parameters.
This method clears the internal cache of samples and features used during audio processing. It is typically called when starting a new processing session or when the existing cache needs to be refreshed.
samples
A tensor that holds the current audio samples.
samples
A tensor that tracks the length of the current samples.
feats
A tensor that holds the current features extracted from the audio samples.
- Parameters:None
- Returns: None
############# Examples
Create an instance of OnlineAudioProcessor
processor = OnlineAudioProcessor(feature_extractor, normalization_module,
decoding_window, encoder_sub_factor, frontend_conf, device)
Reset the cache before processing new audio data
processor.reset_cache()
######## NOTE This method does not take any parameters and does not return anything. It is primarily for internal use within the OnlineAudioProcessor class.