espnet2.train.preprocessor.EnhPreprocessor
espnet2.train.preprocessor.EnhPreprocessor
class espnet2.train.preprocessor.EnhPreprocessor(train: bool, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech_mix', speech_ref_name_prefix: str = 'speech_ref', noise_ref_name_prefix: str = 'noise_ref', dereverb_ref_name_prefix: str = 'dereverb_ref', use_reverberant_ref: bool = False, num_spk: int = 1, num_noise_type: int = 1, sample_rate: int = 8000, force_single_channel: bool = False, channel_reordering: bool = False, categories: List | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, speech_segment: int | None = None, avoid_allzero_segment: bool = True, flexible_numspk: bool = False)
Bases: CommonPreprocessor
Preprocessor for Speech Enhancement (Enh) task.
This class is responsible for processing audio data for speech enhancement tasks. It applies techniques such as room impulse response (RIR) convolution and noise addition to the input audio signals, enabling the training of models to enhance speech quality.
rir_scp
Path to the RIR scp file.
- Type: Optional[str]
rir_apply_prob
Probability of applying RIR.
- Type: float
noise_scp
Path to the noise scp file.
- Type: Optional[str]
noise_apply_prob
Probability of applying noise.
- Type: float
noise_db_range
Range of noise levels in dB.
- Type: str
short_noise_thres
Threshold of short noise.
- Type: float
speech_volume_normalize
Volume normalization factor.
- Type: float
speech_name
Key for speech data in input.
- Type: str
speech_ref_name_prefix
Prefix for reference speech keys.
- Type: str
noise_ref_name_prefix
Prefix for noise reference keys.
- Type: str
dereverb_ref_name_prefix
Prefix for dereverberated reference keys.
- Type: str
use_reverberant_ref
Flag to use reverberant reference.
- Type: bool
num_spk
Number of speakers.
- Type: int
num_noise_type
Number of noise types.
- Type: int
sample_rate
Sampling rate for audio processing.
- Type: int
force_single_channel
Flag to convert to single-channel audio.
- Type: bool
channel_reordering
Flag to randomly reorder channels.
- Type: bool
categories
Mapping of categories to unique integers.
- Type: Optional[List]
data_aug_effects
Data augmentation effects to apply.
- Type: List
data_aug_num
Number of augmentations to apply.
- Type: List[int]
data_aug_prob
Probability of applying data augmentation.
- Type: float
speech_segment
Length of speech segments for processing.
- Type: Optional[int]
avoid_allzero_segment
Flag to avoid all-zero segments.
- Type: bool
flexible_numspk
Flag to allow variable number of speakers.
Type: bool
Parameters:
- train (bool) – Whether to use in training mode.
- rir_scp (Optional *[*str ]) – Path to the RIR scp file.
- rir_apply_prob (float) – Probability of applying RIR.
- noise_scp (Optional *[*str ]) – Path to the noise scp file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold of short noise.
- speech_volume_normalize (float) – Volume normalization factor.
- speech_name (str) – Key for speech data in input.
- speech_ref_name_prefix (str) – Prefix for reference speech keys.
- noise_ref_name_prefix (str) – Prefix for noise reference keys.
- dereverb_ref_name_prefix (str) – Prefix for dereverberated reference keys.
- use_reverberant_ref (bool) – Flag to use reverberant reference.
- num_spk (int) – Number of speakers.
- num_noise_type (int) – Number of noise types.
- sample_rate (int) – Sampling rate for audio processing.
- force_single_channel (bool) – Flag to convert to single-channel audio.
- channel_reordering (bool) – Flag to randomly reorder channels.
- categories (Optional *[*List ]) – Mapping of categories to unique integers.
- data_aug_effects (List) – Data augmentation effects to apply.
- data_aug_num (List *[*int ]) – Number of augmentations to apply.
- data_aug_prob (float) – Probability of applying data augmentation.
- speech_segment (Optional *[*int ]) – Length of speech segments for processing.
- avoid_allzero_segment (bool) – Flag to avoid all-zero segments.
- flexible_numspk (bool) – Flag to allow variable number of speakers.
Examples
>>> preprocessor = EnhPreprocessor(train=True, rir_scp="path/to/rir.scp",
... noise_scp="path/to/noise.scp")
>>> processed_data = preprocessor(uid="sample_uid", data={"speech": audio_data})
- Raises:ValueError – If any of the input parameters are invalid or inconsistent.
NOTE
Ensure that the sampling rates of all audio data and RIRs are consistent to avoid processing errors.