ESPnet3 Dataset References And Builders
ESPnet3 Dataset References And Builders
This page is the main reference for how ESPnet3 resolves datasets from YAML and how the create_dataset stage interacts with recipe dataset modules.
Key implementations:
espnet3.components.data.dataset_moduleespnet3.components.data.dataset_builder.DatasetBuilderespnet3.systems.base.system.BaseSystem.create_dataset
Dataset resolution from YAML
Each dataset entry is a small reference object. The important keys are:
data_srcdata_src_args
Example:
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- data_src: mini_an4/asr
data_src_args:
split: trainESPnet3 parses that entry and then instantiates:
Dataset(**data_src_args)from the resolved dataset module.
The three supported dataset reference forms
1. Dataset tag
data_src: mini_an4/asrThis resolves to:
egs3.mini_an4.asr.datasetTags are convenient for reusing another recipe's dataset module.
2. Explicit module path
data_src: egs3.mini_an4.asr.datasetThis imports the given module directly.
3. Omit data_src
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: trainIf data_src is omitted, ESPnet3 loads:
${recipe_dir}/dataset/__init__.pyThis is the normal pattern for recipe-local datasets.
What data_src_args means
data_src_args is passed directly to the exported Dataset class.
Example:
data_src_args:
split: test-clean
recipe_dir: ${recipe_dir}
source_dir: ${dataset_dir}becomes:
Dataset(split="test-clean", recipe_dir=recipe_dir, source_dir=dataset_dir)Top-level dataset entry fields such as:
nametransform
are not part of Dataset.__init__. They stay in organizer space.
Expected dataset module structure
A dataset module should export:
from .builder import MyBuilder as DatasetBuilder
from .dataset import MyDataset as DatasetThis lets the same module support:
- normal dataset instantiation through
Dataset create_datasetthroughDatasetBuilder
create_dataset call flow
BaseSystem.create_dataset() loops over all dataset entries in:
dataset.traindataset.validdataset.test
For each unique dataset source, it instantiates DatasetBuilder() and runs:
if not builder.is_source_prepared(**builder_kwargs):
builder.prepare_source(**builder_kwargs)
if not builder.is_built(**builder_kwargs):
builder.build(**builder_kwargs)The builder_kwargs come from:
create_dataset:
...in training.yaml.
Builder lifecycle in detail
is_source_prepared()
Cheap check for raw source availability.
Examples:
- extracted corpus exists
- source directory exists
- archive was already unpacked
Returns:
True: raw source is availableFalse:prepare_source()should run
prepare_source()
Makes raw source available.
Typical work:
- download archives
- extract corpora
- normalize incoming directory layout
- verify that required source directories exist
is_built()
Cheap check for task-ready artifacts.
Examples:
- manifest TSVs already exist
- precomputed files are already written
- for raw-directory-backed recipes, this may simply mirror source readiness
Returns:
True:build()can be skippedFalse:build()should run
build()
Creates the files consumed by the dataset class or later stages.
Typical work:
- manifest generation
- audio conversion
- metadata generation
- derived split preparation
Example 1: mini_an4
Local mode config:
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: train
valid:
- data_src_args:
split: valid
test:
- name: test
data_src_args:
split: test
create_dataset:
recipe_dir: ${recipe_dir}What happens:
- local dataset module
${recipe_dir}/dataset/__init__.pyis loaded MiniAn4Builderis instantiatedprepare_source()extracts AN4 if neededbuild()converts audio and writes manifest TSVsMiniAn4Dataset(split=...)reads those manifest files
Example 2: librispeech_100
Local mode config:
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: train-clean-100
valid:
- data_src_args:
split: dev-clean
create_dataset:
recipe_dir: ${recipe_dir}
source_dir: ${dataset_dir}What happens:
- local dataset module is loaded
LibriSpeech100Buildervalidates source availabilitybuild()does not generate manifests; it only preserves the ready stateLibriSpeech100Dataset(split=..., source_dir=...)scans the original corpus tree directly
This is the useful contrast with mini_an4:
mini_an4: source preparation plus manifest buildlibrispeech_100: source validation with direct raw-tree reading
Recommended practical rules
- Put recipe-local dataset code under
dataset/, notsrc/. - Keep
DatasetBuilderidempotent. - Keep
is_source_prepared()andis_built()cheap. - Put recipe-specific constructor args in
data_src_args. - Put builder-stage args in
create_dataset.
