ESPnet3 Dataset References And Builders
ESPnet3 Dataset References And Builders
This page is the main reference for how ESPnet3 resolves datasets from YAML and how the create_dataset stage interacts with recipe dataset modules.
Key implementations:
espnet3.components.data.dataset_moduleespnet3.components.data.dataset_builder.DatasetBuilderespnet3.systems.base.system.BaseSystem.create_dataset
Dataset resolution from YAML
Each dataset entry is a small reference object. The important keys are:
data_srcdata_src_args
Example:
dataset:
recipe_dir: ${recipe_dir}
train:
- data_src: mini_an4/asr
data_src_args:
split: trainYou usually do not need to write _target_ here. dataset already defaults to DataOrganizer.
ESPnet3 parses that entry and then instantiates:
Dataset(**data_src_args)from the resolved dataset module.
The three supported dataset reference forms
- Dataset tag
data_src: mini_an4/asrThis resolves to:
egs3.mini_an4.asr.datasetTags are convenient for reusing another recipe's dataset module.
- Explicit module path
data_src: egs3.mini_an4.asr.datasetThis imports the given module directly.
- Omit
data_src
dataset:
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: trainIf data_src is omitted, ESPnet3 loads:
${recipe_dir}/dataset/__init__.pyThis is the normal pattern for recipe-local datasets.
What data_src_args means
data_src_args is passed directly to the exported Dataset class.
Example:
data_src_args:
split: test-clean
recipe_dir: ${recipe_dir}
source_dir: ${dataset_dir}becomes:
Dataset(split="test-clean", recipe_dir=recipe_dir, source_dir=dataset_dir)Top-level dataset entry fields such as:
nametransform
are not part of Dataset.__init__. They stay in organizer space.
Expected dataset module structure
A dataset module should export:
from .builder import MyBuilder as DatasetBuilder
from .dataset import MyDataset as DatasetThis lets the same module support:
- normal dataset instantiation through
Dataset create_datasetthroughDatasetBuilder
create_dataset call flow
BaseSystem.create_dataset() loops over all dataset entries in:
dataset.traindataset.validdataset.test
For each unique dataset source, it instantiates DatasetBuilder() and runs:
if not builder.is_source_prepared(**builder_kwargs):
builder.prepare_source(**builder_kwargs)
if not builder.is_built(**builder_kwargs):
builder.build(**builder_kwargs)The builder_kwargs come from:
create_dataset:
...in training.yaml.
For a task-oriented walkthrough, start with the custom dataset guide. For lifecycle diagrams and parallel preparation patterns, use the lifecycle pages below.
Custom Dataset Guide
See the practical recipe-local Dataset and DatasetBuilder workflow.
Data Pipeline Migration
See how local/data.sh maps to builder.py, dataset.py, and config.
Parallel Data Preparation
See the DatasetBuilder lifecycle diagram and parallel download pattern.
Builder lifecycle in detail
is_source_prepared()
Cheap check for raw source availability.
Examples:
- extracted corpus exists
- source directory exists
- archive was already unpacked
Returns:
True: raw source is availableFalse:prepare_source()should run
prepare_source()
Makes raw source available.
Typical work:
- download archives
- extract corpora
- normalize incoming directory layout
- verify that required source directories exist
is_built()
Cheap check for task-ready artifacts.
Examples:
- manifest TSVs already exist
- precomputed files are already written
- for raw-directory-backed recipes, this may simply mirror source readiness
Returns:
True:build()can be skippedFalse:build()should run
build()
Creates the files consumed by the dataset class or later stages.
Typical work:
- manifest generation
- audio conversion
- metadata generation
- derived split preparation
Example 1: mini_an4
Local mode config:
dataset:
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: train
valid:
- data_src_args:
split: valid
test:
- name: test
data_src_args:
split: test
create_dataset:
recipe_dir: ${recipe_dir}What happens:
- local dataset module
${recipe_dir}/dataset/__init__.pyis loaded MiniAn4Builderis instantiatedprepare_source()extracts AN4 if neededbuild()converts audio and writes manifest TSVsMiniAn4Dataset(split=...)reads those manifest files
Example 2: librispeech_100
Local mode config:
dataset:
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: train-clean-100
valid:
- data_src_args:
split: dev-clean
create_dataset:
recipe_dir: ${recipe_dir}
source_dir: ${dataset_dir}What happens:
- local dataset module is loaded
LibriSpeech100Buildervalidates source availabilitybuild()does not generate manifests; it only preserves the ready stateLibriSpeech100Dataset(split=..., source_dir=...)scans the original corpus tree directly
This is the useful contrast with mini_an4:
mini_an4: source preparation plus manifest buildlibrispeech_100: source validation with direct raw-tree reading
Recommended practical rules
- Put recipe-local dataset code under
dataset/, notsrc/. - Keep
DatasetBuilderidempotent. - Keep
is_source_prepared()andis_built()cheap. - Put recipe-specific constructor args in
data_src_args. - Put builder-stage args in
create_dataset.
Related pages
DataOrganizer
See how dataset references are organized into train, valid, and test splits.
Dataloader
See how datasets connect to collate functions and iterators.
Create dataset stage
Return to the stage-level dataset preparation flow.
Custom Dataset Guide
Read the end-user guide for Dataset, DatasetBuilder, and dataset config.
Data Pipeline Migration
Read the lifecycle-focused guide for porting old data preparation.
Parallel Data Preparation
See lifecycle diagrams and provider/runner use for heavy data prep.
