espnet2.train.preprocessor.TSEPreprocessor

About 3 min

espnet2.train.preprocessor.TSEPreprocessor

class espnet2.train.preprocessor.TSEPreprocessor(train: bool, train_spk2enroll: str | None = None, enroll_segment: int | None = None, load_spk_embedding: bool = False, load_all_speakers: bool = False, rir_scp: str | None = None, rir_apply_prob: float = 1.0, noise_scp: str | None = None, noise_apply_prob: float = 1.0, noise_db_range: str = '3_10', short_noise_thres: float = 0.5, speech_volume_normalize: float | None = None, speech_name: str = 'speech_mix', speech_ref_name_prefix: str = 'speech_ref', noise_ref_name_prefix: str = 'noise_ref', dereverb_ref_name_prefix: str = 'dereverb_ref', use_reverberant_ref: bool = False, num_spk: int = 1, num_noise_type: int = 1, sample_rate: int = 8000, force_single_channel: bool = False, channel_reordering: bool = False, categories: List | None = None, data_aug_effects: List | None = None, data_aug_num: List[int] = [1, 1], data_aug_prob: float = 0.0, speech_segment: int | None = None, avoid_allzero_segment: bool = True, flexible_numspk: bool = False)

Bases: EnhPreprocessor

Preprocessor for Target Speaker Extraction.

This class processes audio data for the target speaker extraction task. It handles enrollment audio, applies noise and reverberation effects, and manages speaker embeddings based on the provided configuration. The preprocessor is designed to work in both training and evaluation modes.

train

Indicates whether the preprocessor is in training mode.

Type: bool

train

_spk2enroll

Path to the speaker-to-enrollment mapping.

Type: Optional[str]

enroll_segment

Length of the enrollment audio segment.

Type: int

load_spk_embedding

Flag to load speaker embeddings instead of enrollment audios.

Type: bool

load_all_speakers

Flag to load all speakers in each mixture sample.

Type: bool

rir_scp

Path to the RIR (Room Impulse Response) scp file.

Type: Optional[str]

rir_apply_prob

Probability of applying RIR effects.

Type: float

noise_scp

Path to the noise scp file.

Type: Optional[str]

noise_apply_prob

Probability of applying noise.

Type: float

noise_db_range

Range of noise levels in dB.

Type: str

short_noise_thres

Threshold for short noise segments.

Type: float

speech_volume_normalize

Factor to normalize speech volume.

Type: float

speech_name

Key for accessing the speech data.

Type: str

speech_ref_name_prefix

Prefix for accessing reference speech data.

Type: str

noise_ref_name_prefix

Prefix for accessing noise reference data.

Type: str

dereverb_ref_name_prefix

Prefix for accessing dereverberated reference data.

Type: str

use_reverberant_ref

Flag to use reverberant reference signals.

Type: bool

num_spk

Number of speakers involved in the extraction.

Type: int

num_noise_type

Number of types of noise.

Type: int

sample_rate

Sampling rate for the audio.

Type: int

force_single_channel

Flag to force single-channel audio.

Type: bool

channel_reordering

Flag to reorder audio channels.

Type: bool

Examples

Initialize the preprocessor for training mode

preprocessor = TSEPreprocessor(

train=True, train_spk2enroll=”path/to/spk2enroll.json”, enroll_segment=3000, load_spk_embedding=False, load_all_speakers=True, rir_scp=”path/to/rir.scp”, noise_scp=”path/to/noise.scp”

)

Process an audio sample

processed_data = preprocessor(uid=”sample_id”, data={“speech_mix”: audio_data})

Raises:ValueError – If the input data is not in the expected format.