Data and dataloader

About 3 min

Data and dataloader

If you already know PyTorch, the dataset part of ESPnet3 is not very exotic.

The short version is:

Important

Your recipe-local dataset can usually just be a normal torch.utils.data.Dataset.

ESPnet3 mainly adds:

config-driven dataset resolution
DataOrganizer for train/valid/test wiring
builder logic for dataset preparation
a dataloader config layer

The part that stays familiar

If you know PyTorch, this still looks normal:

from torch.utils.data import Dataset


class MyDataset(Dataset):
    def __init__(self, split: str):
        self.split = split
        self.samples = load_manifest(split)

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]
        return {
            "speech": load_audio(sample["wav"]),
            "text": sample["text"],
        }

That is already a valid starting point for many ESPnet3 recipes.

Where the dataset lives

Typical recipe-local layout:

egs3/<recipe>/<system>/
  dataset/
    __init__.py
    dataset.py
    builder.py

__init__.py usually exports:

from .dataset import MyDataset as Dataset

That is how DataOrganizer finds the local dataset by default.

What builder.py is for

builder.py is the standard preparation hook used by create_dataset.

Use it for one-time preparation such as:

download
extraction
manifest generation
offline augmentation

So the simple rule is:

sample loading -> dataset.py
one-time preparation -> builder.py

Dataset References

See the recipe-local Dataset and DatasetBuilder module contract.

Create Dataset Stage

See how dataset preparation is launched from a stage.

Data Pipeline

See how recipe data prep maps to builder.py and dataset.py.

What ESPnet3 adds on top of a plain dataset

The main extra layer is DataOrganizer.

It wires config into:

train
valid
named test sets

So instead of manually constructing several datasets in Python code, ESPnet3 lets YAML define the split layout.

Minimal example:

dataset:
  _target_: espnet3.components.data.data_organizer.DataOrganizer
  recipe_dir: ${recipe_dir}
  train:
    - data_src_args:
        split: train
  valid:
    - data_src_args:
        split: valid
  test:
    - name: test
      data_src_args:
        split: test

If data_src is omitted, ESPnet3 loads the local recipe dataset module.

Dataset Config

See the train, valid, test, data_src, and data_src_args format.

DataOrganizer

See how config entries become concrete dataset objects.

Training Config

See where dataset config sits in training.yaml.

Dataloader: slightly more ESPnet-specific

This is the part where ESPnet3 adds a bit more structure than plain PyTorch.

You still have normal concepts such as:

batch size
shuffle
collate function

But they are usually expressed in config.

Minimal example:

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1

  train:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: true
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: unsorted
        batch_size: 4

  valid:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: false
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: unsorted
        batch_size: 4

Dataloader

See iterator, batch sampler, collate, and logging details.

Stats Collection

See how batch shapes and stats interact with dataloader behavior.

Train Stage

See how training consumes dataset and dataloader config.

Why this is still nice for PyTorch users

Even though the loader is more config-driven, the split is still clean:

dataset handles sample loading
collate handles batch formatting
config decides which dataset goes to which stage

So the mental model is still close to plain PyTorch.

About CommonCollateFn

If your dataset returns ordinary sample dicts, CommonCollateFn is often still a very useful default.

It can help with:

automatic padding
sequence length handling
batch formatting that matches existing ESPnet model conventions

So a normal Dataset plus ESPnet collate is a good default combination.

When you need the detailed docs

You usually do not need to learn the whole dataloader stack up front.

Start simple:

write a plain Dataset
make training.yaml point to it
keep CommonCollateFn if it already works
only then study the iterator/batching details

Custom dataset

Read the finetuning-oriented overview of recipe-local datasets and builders.

Dataloader

Read the detailed loader, collate, and iterator behavior.

DataOrganizer

See how train, valid, and test datasets are assembled from YAML.

Dataset references

See how recipe-local dataset modules and builders are resolved.

Dataset Config

See the YAML dataset format used by training and inference.

Training Config

See where the dataloader and trainer config actually live.