π¦ ESPnet3 DataOrganizer and Dataset Pipeline
This document provides an overview of the dataset pipeline used in ESPnet3, covering:
DataOrganizerDatasetConfigCombinedDatasetDatasetWithTransformShardedDataset(extension point)- ESPnet-specific preprocessor behavior
π§ System Overview
dataset.yaml
β
Hydra (config construction)
β
DataOrganizer
βββ CombinedDataset (train / valid)
β ββ (transform, preprocessor) per dataset
βββ DatasetWithTransform (test)π DataOrganizer
Purpose
DataOrganizer constructs and organizes train, validation, and test datasets using Hydra configuration dictionaries. It wraps datasets into unified interfaces for data loading and transformation.
Behavior
Train / Valid β Combined into
CombinedDatasetTest β Wrapped in individual
DatasetWithTransforminstancesEach sample flows through:
transform(sample)preprocessor(sample)orpreprocessor(uid, sample)
Automatic Preprocessor Handling
In ESPnet3, the type of preprocessor is automatically inferred:
- If it is an instance of
AbsPreprocessor, the call will use(uid, sample) - Otherwise, it is called with a single argument:
sample
This means users do not need to worry about manually providing uid.
# Internally handled:
sample = transform(raw_sample)
sample = preprocessor(sample) # or
sample = preprocessor(uid, sample) # if it's AbsPreprocessorESPnet-Specific Note
When training, ESPnet's CommonCollator expects (uid, sample) pairs. To support this:
organizer.train.use_espnet_collator = True
sample = organizer.train[0] # Returns (uid, sample)β But end users do not need to set this manually. The system handles it internally.
π§Ύ Minimal YAML example
DataOrganizer is typically configured under the dataset section of your stage YAML (e.g., train.yaml, infer.yaml, measure.yaml).
Minimal example that defines train, valid, and test splits:
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
train:
- name: train
dataset:
_target_: src.dataset.MyDataset
split: train
transform:
_target_: src.transforms.my_transform
valid:
- name: valid
dataset:
_target_: src.dataset.MyDataset
split: valid
test:
- name: test
dataset:
_target_: src.dataset.MyDataset
split: testNotes:
- Provide both
trainandvalid(or omit both). Providing only one of them is invalid. test[*].namebecomes the key used by inference/measurement to select a split (viaconfig.test_set/dataset.testenumeration).
βοΈ DatasetConfig
A dataclass representing a single dataset's configuration.
- name: dev-clean
dataset:
_target_: my_project.datasets.MyDataset
split: dev-clean
transform:
_target_: my_project.transforms.to_upperFields
| Field | Description |
|---|---|
name | Dataset name |
dataset | Hydra config for dataset instantiation |
transform | Hydra config for transformation instantiation |
π CombinedDataset
Combines multiple datasets into a single __getitem__-compatible interface.
Features
- Applies
(transform, preprocessor)pair per dataset - Supports ESPnet-style UID processing if applicable
- Ensures consistent sample keys across datasets
- Optional sharding support (if all datasets subclass
ShardedDataset)
π DatasetWithTransform
A lightweight wrapper for applying a single (transform β preprocessor) to a dataset. Used primarily for test sets.
wrapped = DatasetWithTransform(
dataset,
transform,
preprocessor,
use_espnet_preprocessor=True
)π§© ShardedDataset
An abstract class representing sharding capability for distributed training.
class MyDataset(ShardedDataset):
def shard(self, idx):
return Subset(self, some_index_subset)βοΈ Preprocessor Behavior in ESPnet3
Auto-Type Detection
ESPnet3 automatically determines how to call the preprocessor:
| Type | Call Signature | Use Case |
|---|---|---|
| Regular callable | preprocessor(sample) | Custom/simple processing |
Instance of AbsPreprocessor | preprocessor(uid, sample) | ESPnet's internal processors |
Intended Responsibilities
| Component | Role | User Expectations |
|---|---|---|
| Dataset | Load raw data only | Implement __getitem__ returning dict |
| Transform | Lightweight online modifications | e.g., normalization, text cleaning |
| Preprocessor | Mostly for ESPnet2's CommonPreprocessor | Follows ESPnet2-supported types only |
π The only officially supported preprocessors are those implemented in espnet2/train/preprocessor.py
π§© Writing a custom DataOrganizer
Most users will not need a custom organizer (the built-in DataOrganizer is config-driven), but if you want to implement your own organizer class, the important part is matching the expectations of the stages that consume it.
Where it is used (what must work)
Think of this as "what code will run against your organizer".
Training / collect_stats
Your organizer is expected to expose train and valid datasets that behave like PyTorch datasets:
organizer = instantiate(cfg.dataset) # _target_: ...DataOrganizer or your custom class
train_ds = organizer.train
valid_ds = organizer.valid
assert len(train_ds) > 0
sample0 = train_ds[0] # usually a dict, or (uid, dict) in ESPnet-collator modeIf you support ESPnet2-style training that expects (uid, sample) pairs, you need a switch that changes the return type:
# Example pattern used by CombinedDataset
train_ds.use_espnet_collator = True
uid, sample = train_ds[0]
assert isinstance(uid, str)
assert isinstance(sample, dict)Inference
Inference selects a named test set via config.test_set and expects your organizer to expose a test mapping:
organizer = instantiate(cfg.dataset)
test_name = cfg.test_set # set by the inference loop for each dataset.test entry
test_ds = organizer.test[test_name] # dict-like lookup
item = test_ds[0] # dict-like dataset itemIf organizer.test[test_name] fails, inference cannot route to the requested split.
Input/output contracts to keep in mind
Dataset item (input)
- Dataset items should be dict-like (a plain
dictis ideal). - Keys must be consistent across datasets when you combine them (the default
CombinedDatasetenforces this by checking sample keys).
Transform and preprocessor (mid-pipeline)
transform(sample)should return a dict.preprocessor(sample)orpreprocessor(uid, sample)should return a dict.- Avoid mutating shared objects in-place unless you know your dataset is not reused across workers/processes.
Train-time collator mode (output)
If you support an ESPnet-style collator mode, the dataset output becomes:
(uid: str, sample: dict)where uid is stable and unique within the dataset.
Minimal skeleton (conceptual)
class MyOrganizer:
def __init__(self, *, train=None, valid=None, test=None, preprocessor=None):
self.train = build_train_dataset(train, preprocessor=preprocessor)
self.valid = build_valid_dataset(valid, preprocessor=preprocessor)
self._test_sets = build_test_sets(test, preprocessor=preprocessor)
@property
def test(self):
return self._test_sets # dict-like: name -> datasetIf you plan to upstream this (PR guidance)
If you are planning to submit a PR that changes DataOrganizer behavior or adds a new organizer abstraction, implement it while running the existing unit tests as a spec:
test/espnet3/components/data/test_data_organizer.py
That test suite is the most concrete description of the current expected behavior (train/valid consistency, test set mapping, key consistency, and collator mode expectations).
Common pitfalls
- Providing only
trainor onlyvalid: training assumes both exist (or neither). - Inconsistent sample keys across datasets: collate functions and models will break.
- Missing test name mapping: inference cannot select a test set if
organizer.test[name]fails. - Unstable or missing
uttid: downstream.scpwriting and measurement alignment become painful.
