espnet2.train.preprocessor.TSEPreprocessor
espnet2.train.preprocessor.TSEPreprocessor
class espnet2.train.preprocessor.TSEPreprocessor(train: bool, train_spk2enroll: str | None = None, enroll_segment: int | None = None, load_spk_embedding: bool = False, load_all_speakers: bool = False, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech_mix', speech_ref_name_prefix: str = 'speech_ref', noise_ref_name_prefix: str = 'noise_ref', dereverb_ref_name_prefix: str = 'dereverb_ref', use_reverberant_ref: bool = False, num_spk: int = 1, num_noise_type: int = 1, sample_rate: int = 8000, force_single_channel: bool = False, channel_reordering: bool = False, categories: List | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, speech_segment: int | None = None, avoid_allzero_segment: bool = True, flexible_numspk: bool = False)
Bases: EnhPreprocessor
Preprocessor for Target Speaker Extraction.
This class processes audio data for the target speaker extraction task. It handles enrollment audio, applies noise and reverberation effects, and manages speaker embeddings based on the provided configuration. The preprocessor is designed to work in both training and evaluation modes.
train
Indicates whether the preprocessor is in training mode.
- Type: bool
train
Path to the speaker-to-enrollment mapping.
- Type: Optional[str]
enroll_segment
Length of the enrollment audio segment.
- Type: int
load_spk_embedding
Flag to load speaker embeddings instead of enrollment audios.
- Type: bool
load_all_speakers
Flag to load all speakers in each mixture sample.
- Type: bool
rir_scp
Path to the RIR (Room Impulse Response) scp file.
- Type: Optional[str]
rir_apply_prob
Probability of applying RIR effects.
- Type: float
noise_scp
Path to the noise scp file.
- Type: Optional[str]
noise_apply_prob
Probability of applying noise.
- Type: float
noise_db_range
Range of noise levels in dB.
- Type: str
short_noise_thres
Threshold for short noise segments.
- Type: float
speech_volume_normalize
Factor to normalize speech volume.
- Type: float
speech_name
Key for accessing the speech data.
- Type: str
speech_ref_name_prefix
Prefix for accessing reference speech data.
- Type: str
noise_ref_name_prefix
Prefix for accessing noise reference data.
- Type: str
dereverb_ref_name_prefix
Prefix for accessing dereverberated reference data.
- Type: str
use_reverberant_ref
Flag to use reverberant reference signals.
- Type: bool
num_spk
Number of speakers involved in the extraction.
- Type: int
num_noise_type
Number of types of noise.
- Type: int
sample_rate
Sampling rate for the audio.
- Type: int
force_single_channel
Flag to force single-channel audio.
- Type: bool
channel_reordering
Flag to reorder audio channels.
- Type: bool
categories
List of categories for classification.
- Type: Optional[List]
data_aug_effects
List of data augmentation effects.
- Type: List
data_aug_num
Number of augmentations to apply.
- Type: List[int]
data_aug_prob
Probability of applying data augmentation.
- Type: float
speech_segment
Length of speech segments for processing.
- Type: Optional[int]
avoid_allzero_segment
Flag to avoid all-zero audio segments.
- Type: bool
flexible_numspk
Flag to allow variable number of speakers.
Type: bool
Parameters:
- train (bool) – Whether to use in training mode.
- train_spk2enroll (Optional *[*str ]) – Path to the speaker-to-enrollment mapping.
- enroll_segment (int) – Length of the enrollment audio segment.
- load_spk_embedding (bool) – Flag to load speaker embeddings instead of enrollment audios.
- load_all_speakers (bool) – Flag to load all speakers in each mixture sample.
- rir_scp (Optional *[*str ]) – Path to the RIR scp file.
- rir_apply_prob (float) – Probability of applying RIR effects.
- noise_scp (Optional *[*str ]) – Path to the noise scp file.
- noise_apply_prob (float) – Probability of applying noise.
- noise_db_range (str) – Range of noise levels in dB.
- short_noise_thres (float) – Threshold for short noise segments.
- speech_volume_normalize (float) – Factor to normalize speech volume.
- speech_name (str) – Key for accessing the speech data.
- speech_ref_name_prefix (str) – Prefix for accessing reference speech data.
- noise_ref_name_prefix (str) – Prefix for accessing noise reference data.
- dereverb_ref_name_prefix (str) – Prefix for accessing dereverberated reference data.
- use_reverberant_ref (bool) – Flag to use reverberant reference signals.
- num_spk (int) – Number of speakers involved in the extraction.
- num_noise_type (int) – Number of types of noise.
- sample_rate (int) – Sampling rate for the audio.
- force_single_channel (bool) – Flag to force single-channel audio.
- channel_reordering (bool) – Flag to reorder audio channels.
- categories (Optional *[*List ]) – List of categories for classification.
- data_aug_effects (List) – List of data augmentation effects.
- data_aug_num (List *[*int ]) – Number of augmentations to apply.
- data_aug_prob (float) – Probability of applying data augmentation.
- speech_segment (Optional *[*int ]) – Length of speech segments for processing.
- avoid_allzero_segment (bool) – Flag to avoid all-zero audio segments.
- flexible_numspk (bool) – Flag to allow variable number of speakers.
Examples
Initialize the preprocessor for training mode
preprocessor = TSEPreprocessor(
train=True, train_spk2enroll=”path/to/spk2enroll.json”, enroll_segment=3000, load_spk_embedding=False, load_all_speakers=True, rir_scp=”path/to/rir.scp”, noise_scp=”path/to/noise.scp”
)
Process an audio sample
processed_data = preprocessor(uid=”sample_id”, data={“speech_mix”: audio_data})
- Raises:ValueError – If the input data is not in the expected format.