ESPnet3 Train Dataset
ESPnet3 Train Dataset
This page focuses on how training configs describe datasets today.
The full dataset resolution and builder story is documented in:
Current pattern
Training configs use DataOrganizer plus dataset reference entries.
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- data_src: mini_an4/asr
data_src_args:
split: train
valid:
- data_src: mini_an4/asr
data_src_args:
split: valid
test:
- name: test
data_src: mini_an4/asr
data_src_args:
split: testEach item can specify the dataset in three ways.
1. Omit data_src and use the local recipe dataset
If data_src is omitted, ESPnet3 loads:
${recipe_dir}/dataset/__init__.pyThis is what mini_an4 and librispeech_100 do in their training configs.
train:
- data_src_args:
split: train
valid:
- data_src_args:
split: valid2. Use a dataset tag
train:
- data_src: mini_an4/asr
data_src_args:
split: trainThis resolves to:
egs3.mini_an4.asr.dataset3. Use an explicit module path
train:
- data_src: egs3.mini_an4.asr.dataset
data_src_args:
split: trainWhat is forwarded to the dataset constructor
Only data_src_args is passed to the exported Dataset class:
Dataset(**data_src_args)So a config like:
- name: train-clean
data_src: librispeech_100/asr
data_src_args:
split: train-clean-100
recipe_dir: ${recipe_dir}becomes:
Dataset(split="train-clean-100", recipe_dir=recipe_dir)name and transform are handled by DataOrganizer, not by Dataset.
Current recipe examples
mini_an4
mini_an4 exports:
dataset/__init__.pydataset/dataset.pydataset/builder.py
Its local dataset mode is:
dataset:
recipe_dir: ${recipe_dir}
train:
- data_src: mini_an4/asr
data_src_args:
split: train
valid:
- data_src: mini_an4/asr
data_src_args:
split: validlibrispeech_100
librispeech_100 uses the same local-mode pattern, but the dataset reads the raw LibriSpeech directory directly instead of manifest TSVs.
Train and valid requirements
Current DataOrganizer requires:
- both
trainandvalid, or - neither
If training is the goal, define both.
Test entries
test entries are optional for the train stage itself, but adding them in the same config is useful because:
- inference can reuse the same dataset definition
- measurement can reuse test-set names
Each test entry should define name, because that becomes the test-set name used in inference_dir/<test_name>/.
