Data and dataloader
Data and dataloader
If you already know PyTorch, the dataset part of ESPnet3 is not very exotic.
The short version is:
Important
Your recipe-local dataset can usually just be a normal torch.utils.data.Dataset.
ESPnet3 mainly adds:
- config-driven dataset resolution
DataOrganizerfor train/valid/test wiring- builder logic for dataset preparation
- a dataloader config layer
The part that stays familiar
If you know PyTorch, this still looks normal:
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, split: str):
self.split = split
self.samples = load_manifest(split)
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
sample = self.samples[idx]
return {
"speech": load_audio(sample["wav"]),
"text": sample["text"],
}That is already a valid starting point for many ESPnet3 recipes.
Where the dataset lives
Typical recipe-local layout:
egs3/<recipe>/<system>/
dataset/
__init__.py
dataset.py
builder.py__init__.py usually exports:
from .dataset import MyDataset as DatasetThat is how DataOrganizer finds the local dataset by default.
What builder.py is for
builder.py is the standard preparation hook used by create_dataset.
Use it for one-time preparation such as:
- download
- extraction
- manifest generation
- offline augmentation
So the simple rule is:
- sample loading ->
dataset.py - one-time preparation ->
builder.py
Dataset References
See the recipe-local Dataset and DatasetBuilder module contract.
Create Dataset Stage
See how dataset preparation is launched from a stage.
Data Pipeline
See how recipe data prep maps to builder.py and dataset.py.
What ESPnet3 adds on top of a plain dataset
The main extra layer is DataOrganizer.
It wires config into:
trainvalid- named
testsets
So instead of manually constructing several datasets in Python code, ESPnet3 lets YAML define the split layout.
Minimal example:
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: train
valid:
- data_src_args:
split: valid
test:
- name: test
data_src_args:
split: testIf data_src is omitted, ESPnet3 loads the local recipe dataset module.
Dataset Config
See the train, valid, test, data_src, and data_src_args format.
DataOrganizer
See how config entries become concrete dataset objects.
Training Config
See where dataset config sits in training.yaml.
Dataloader: slightly more ESPnet-specific
This is the part where ESPnet3 adds a bit more structure than plain PyTorch.
You still have normal concepts such as:
- batch size
- shuffle
- collate function
But they are usually expressed in config.
Minimal example:
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
train:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: true
collate_fn: ${dataloader.collate_fn}
batches:
type: unsorted
batch_size: 4
valid:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: false
collate_fn: ${dataloader.collate_fn}
batches:
type: unsorted
batch_size: 4Dataloader
See iterator, batch sampler, collate, and logging details.
Stats Collection
See how batch shapes and stats interact with dataloader behavior.
Train Stage
See how training consumes dataset and dataloader config.
Why this is still nice for PyTorch users
Even though the loader is more config-driven, the split is still clean:
- dataset handles sample loading
- collate handles batch formatting
- config decides which dataset goes to which stage
So the mental model is still close to plain PyTorch.
About CommonCollateFn
If your dataset returns ordinary sample dicts, CommonCollateFn is often still a very useful default.
It can help with:
- automatic padding
- sequence length handling
- batch formatting that matches existing ESPnet model conventions
So a normal Dataset plus ESPnet collate is a good default combination.
When you need the detailed docs
You usually do not need to learn the whole dataloader stack up front.
Start simple:
- write a plain
Dataset - make
training.yamlpoint to it - keep
CommonCollateFnif it already works - only then study the iterator/batching details
Related pages
Custom dataset
Read the finetuning-oriented overview of recipe-local datasets and builders.
Dataloader
Read the detailed loader, collate, and iterator behavior.
DataOrganizer
See how train, valid, and test datasets are assembled from YAML.
Dataset references
See how recipe-local dataset modules and builders are resolved.
Dataset Config
See the YAML dataset format used by training and inference.
Training Config
See where the dataloader and trainer config actually live.
