Data pipeline
Data pipeline
The data pipeline changes a lot when moving from ESPnet2 to ESPnet3.
In ESPnet2, data preparation is usually centered on:
local/data.sh- one or more numbered shell stages in
asr.sh - Kaldi-style files under
data/<split>/
In ESPnet3, the center of gravity moves into Python:
dataset/builder.pydataset/dataset.pydataset:entries in YAMLDataOrganizerandDataLoaderBuilder
The main mental shift
In ESPnet2, data flow is often:
- shell downloads or copies data
- shell rewrites metadata
- shell creates
wav.scp,text,utt2spk, and friends - ESPnet reads those prepared files later
In ESPnet3, data flow is often:
builder.pychecks and prepares source databuilder.pybuilds manifests or converted assetsdataset.pyexposes a normal PyTorch dataset- YAML points
DataOrganizerat that dataset
So the migration is not only "rewrite shell in Python". It is also "make the dataset a real Python object".
ESPnet3: builder.py owns preparation
In ESPnet3, the closest replacement is dataset/builder.py.
Typical layout:
egs3/<recipe>/<task>/dataset/
__init__.py
builder.py
dataset.py
config.yamlThe builder lifecycle is:
is_source_prepared()prepare_source()is_built()build()
This is the main replacement for old shell-side data stages.
How to map old shell steps
| ESPnet2 | ESPnet3 |
|---|---|
local/data.sh download stage | prepare_source() |
| shell extraction / copy step | prepare_source() |
| shell manifest generation | build() |
| shell train/dev split logic | build() |
| shell validation checks | is_source_prepared() / is_built() |
Example: offline preparation in ESPnet3
MiniAn4Builder is a good small example.
Its build() method:
- reads transcripts
- converts SPH to WAV
- splits train into train/valid
- writes TSV manifests
The shape is roughly:
from pathlib import Path
from espnet3.components.data.dataset_builder import DatasetBuilder
class MiniAn4Builder(DatasetBuilder):
def is_source_prepared(self, recipe_dir: str | Path, **kwargs) -> bool:
source_dir = Path(recipe_dir) / "source"
return (source_dir / "an4").is_dir()
def prepare_source(self, recipe_dir: str | Path, **kwargs) -> None:
source_dir = Path(recipe_dir) / "source"
source_dir.mkdir(parents=True, exist_ok=True)
extract_archive(source_dir / "an4_raw.bigendian.tar.gz", source_dir)
def is_built(self, recipe_dir: str | Path, **kwargs) -> bool:
data_dir = Path(recipe_dir) / "data"
return all(
(data_dir / split / "wav_text.tsv").is_file()
for split in ("train", "valid", "test")
)
def build(self, recipe_dir: str | Path, **kwargs) -> None:
source_dir = Path(recipe_dir) / "source" / "an4"
data_dir = Path(recipe_dir) / "data"
rows = read_transcripts(source_dir)
rows = convert_sph_to_wav(rows, data_dir / "wav")
splits = split_train_valid_test(rows)
for split, split_rows in splits.items():
write_manifest(data_dir / split / "wav_text.tsv", split_rows)is_source_prepared() and is_built() should be cheap filesystem checks. prepare_source() and build() do the actual work.
That is exactly the kind of work that would often live in local/data.sh in ESPnet2.
So if your old ESPnet2 recipe had a shell stage that:
- creates files once
- writes reusable manifests
- rewrites dataset layout on disk
the ESPnet3 home for that logic is usually builder.py.
Offline augmentation
ESPnet2 often handled augmentation as an offline preparation step.
Typical examples are:
- speed perturbation stages
- audio rewriting stages
- text normalization written back to disk
- extra derived manifest files
In ESPnet3, the same kind of offline augmentation can still exist.
The natural place is builder.py, especially in build().
Use this when the augmentation:
- should happen once
- should be cached on disk
- changes the reusable dataset assets themselves
Examples:
- resampling all audio files
- writing augmented manifests
- creating a derived dataset directory
- generating pseudo labels once
Online augmentation is easier in ESPnet3
ESPnet3 also makes online augmentation much easier.
You can implement it directly in dataset.py.
That means augmentation can happen:
- when
__getitem__()is called - without rewriting files on disk
- with ordinary PyTorch logic
This is a big practical difference from ESPnet2.
Typical online augmentation locations
In ESPnet3, online augmentation can live in:
- the dataset itself
- a per-entry
transform - a shared
preprocessor
That gives you three useful layers:
- dataset-local augmentation in
dataset.py - split-specific transform in YAML
- shared preprocessor in
DataOrganizer
Example: dataset-local online augmentation
Conceptually:
class MyDataset(torch.utils.data.Dataset):
def __getitem__(self, idx):
sample = load_sample(idx)
sample["speech"] = random_gain(sample["speech"])
return sampleThis is often the cleanest replacement for shell stages that existed only to prepare one slightly modified variant of the same data.
Dataset does not need to be an ESPnet-specific dataset anymore
This is another major change.
In ESPnet2, users often think in terms of ESPnetDataset and task-driven input constraints.
In ESPnet3, recipe-local datasets can be ordinary torch.utils.data.Dataset classes.
That means:
- custom indexing logic is easy
- custom in-memory caching is easy
- custom augmentation is easy
- custom sample dictionaries are easy
You are no longer forced into a special dataset abstraction just to plug into the recipe.
What dataset.py usually returns
ESPnet3 datasets usually return a Python dictionary per sample.
For example:
{
"speech": ...,
"text": ...,
}The exact keys depend on the rest of the pipeline.
Return-value constraints: ESPnet2 model path vs custom model path
This is an important migration detail.
If you still use an ESPnet2 task/model
If training.yaml sets task and reuses the ESPnet2 task bridge, then the data still needs to satisfy the expectations of that task/model pair.
So in that case, the old ESPnet2-style constraints still matter:
- expected key names
- expected tensor shapes
- expected label fields
In other words, the dataset becomes more flexible, but the model side still imposes structure.
If you use a custom ESPnet3 model
If you instantiate your own model directly, the rule is looser.
Then the practical requirement is:
- the dataset must return whatever your model call path consumes
For training, that usually means the batch produced by the dataloader and collate function must match what the model or Lightning wrapper expects.
For inference, it means input_key and output_fn must line up with the dataset and model.
So the migration rule is:
- ESPnet2 task path -> keep task-side conventions
- custom model path -> define your own data contract clearly
DataOrganizer replaces many old recipe-local conventions
DataOrganizer is the config-driven wrapper that turns dataset references into:
trainvalid- named
testsets
Each entry can specify:
data_srcdata_src_argstransformname
That means the old shell convention of "this directory name implies this split" is replaced by explicit config.
Split-specific transforms
ESPnet3 can attach transforms directly in the dataset config.
Conceptually:
dataset:
train:
- data_src_args:
split: train
transform:
_target_: src.transforms.augment_training_sampleThis is another place where online augmentation can live without shell stages.
Collate function support is still strong
If you came from ESPnet2, you may worry that a plain PyTorch dataset means you lose ESPnet conveniences.
You do not.
ESPnet3 can still use normal ESPnet collate functions such as:
espnet2.train.collate_fn.CommonCollateFn
So even if your dataset is a normal torch.utils.data.Dataset, you can still get:
- automatic padding
- length calculation
- batch formatting that works well with ESPnet2-style models
This is one of the nicest parts of the migration story:
- flexible dataset implementation
- still compatible with useful ESPnet batching logic
How to think about collate now
Use this split:
- dataset: sample-level logic
- collate_fn: batch-level padding and formatting
That is cleaner than pushing every data concern into shell-generated files.
A practical migration rule
When reading an old ESPnet2 data pipeline, separate it into:
- one-time source preparation
- one-time artifact building
- per-sample loading
- per-sample augmentation
- batch formatting
Then map them like this:
| Old concern | New home |
|---|---|
| download / extract | prepare_source() |
| manifest generation | build() |
| sample loading | dataset.py |
| online augmentation | dataset.py or transform |
| batch padding / lengths | collate_fn |
When to keep augmentation offline
Prefer offline augmentation in build() when:
- the result should be reused across runs
- the transform is expensive
- the output should become a stable dataset artifact
Prefer online augmentation in dataset.py or transform when:
- randomness is useful every epoch
- you do not want extra disk copies
- the transform is naturally sample-local
Related pages
Datasets
Read the high-level dataset internals and recipe-local module layout.
Dataset references
See how dataset modules, builders, and references are resolved.
DataOrganizer
See how train, valid, and named test sets are assembled from config.
Dataloader
See how collate, iter_factory, and batching are constructed.
Dataset Config
See the YAML dataset entry format used by training and inference.
Cluster and parallel
See how heavy preparation loops can move from shell to provider/runner code.
