π¦ ESPnet3 Data Loading System Documentation
This document provides a comprehensive overview of the dataset system used in ESPnet3, specifically covering:
DataOrganizerDatasetConfigCombinedDatasetDatasetWithTransformShardedDataset(extension point)- ESPnet-specific preprocessor behavior
π§ System Overview
dataset.yaml
β
Hydra (config construction)
β
DataOrganizer
βββ CombinedDataset (train / valid)
β ββ (transform, preprocessor) per dataset
βββ DatasetWithTransform (test)π DataOrganizer
Purpose
DataOrganizer constructs and organizes train, validation, and test datasets using Hydra configuration dictionaries. It wraps datasets into unified interfaces for data loading and transformation.
Behavior
Train / Valid β Combined into
CombinedDatasetTest β Wrapped in individual
DatasetWithTransforminstancesEach sample flows through:
transform(sample)preprocessor(sample)orpreprocessor(uid, sample)
Automatic Preprocessor Handling
In ESPnet3, the type of preprocessor is automatically inferred:
- If it is an instance of
AbsPreprocessor, the call will use(uid, sample) - Otherwise, it is called with a single argument:
sample
This means users do not need to worry about manually providing uid.
# Internally handled:
sample = transform(raw_sample)
sample = preprocessor(sample) # or
sample = preprocessor(uid, sample) # if it's AbsPreprocessorESPnet-Specific Note
When training, ESPnet's CommonCollator expects (uid, sample) pairs. To support this:
organizer.train.use_espnet_collator = True
sample = organizer.train[0] # Returns (uid, sample)β But end users do not need to set this manually. The system handles it internally.
βοΈ DatasetConfig
A dataclass representing a single datasetβs configuration.
- name: dev-clean
dataset:
_target_: my_project.datasets.MyDataset
split: dev-clean
transform:
_target_: my_project.transforms.to_upperFields
| Field | Description |
|---|---|
name | Dataset name |
dataset | Hydra config for dataset instantiation |
transform | Hydra config for transformation instantiation |
π CombinedDataset
Combines multiple datasets into a single __getitem__-compatible interface.
Features
- Applies
(transform, preprocessor)pair per dataset - Supports ESPnet-style UID processing if applicable
- Ensures consistent sample keys across datasets
- Optional sharding support (if all datasets subclass
ShardedDataset)
π DatasetWithTransform
A lightweight wrapper for applying a single (transform β preprocessor) to a dataset. Used primarily for test sets.
wrapped = DatasetWithTransform(
dataset,
transform,
preprocessor,
use_espnet_preprocessor=True
)π§© ShardedDataset
An abstract class representing sharding capability for distributed training.
class MyDataset(ShardedDataset):
def shard(self, idx):
return Subset(self, some_index_subset)βοΈ Preprocessor Behavior in ESPnet3
Auto-Type Detection
ESPnet3 automatically determines how to call the preprocessor:
| Type | Call Signature | Use Case |
|---|---|---|
| Regular callable | preprocessor(sample) | Custom/simple processing |
Instance of AbsPreprocessor | preprocessor(uid, sample) | ESPnetβs internal processors |
No need for users to explicitly handle this distinction: it's handled internally.
Intended Responsibilities
| Component | Role | User Expectations |
|---|---|---|
| Dataset | Load raw data only | Implement __getitem__ returning dict |
| Transform | Lightweight online modifications | e.g., normalization, text cleaning |
| Preprocessor | Mostly for ESPnet2's CommonPreprocessor | Follows ESPnet2-supported types only |
π The only officially supported preprocessors are those implemented in espnet2/train/preprocessor.py
β Summary
- Users only need to implement
Dataset(data loading) andTransform(modification, optional). Preprocessorsupport is automatic, with UID handling taken care of internally.
