ESPnet3 Dataloader Configuration

Masao SomekiLess than 1 minute

ESPnet3 Dataloader Configuration

This page explains how the dataloader section in training.yaml controls batch construction and iteration during training.

The interactive demo below covers all major paths — pipeline overview, iter_factory on/off, batch strategy comparison, chunk splitting, and category-balanced sampling.

① Overview

② iter_factory on/off

③ Batch strategy simulator

④ ChunkIterFactory

⑤ Category iterators

⑥ Required files

Pipeline overview

Data flow in the ESPnet3 training pipeline

The collect_stats stage generates all files consumed by the subsequent train stage. Both stages share the same dataloader: block in training.yaml.

collect_stats
model.collect_feats()

→

feats_shape
stats_dir/train/

feats_stats.npz
for GlobalMVN

→

train
ESPnetLightningModule

① feats_shape — for batching

A text file recording the number of frames per sample.
Passed to batches.shape_files in SequenceIterFactory, enabling length-aware batching (reduced padding, OOM prevention).

# stats_dir/train/feats_shape
utt_001 312
utt_002 489
utt_003 156
utt_004 701
...

② feats_stats.npz — for normalization

A compressed numpy archive storing dataset-wide mean / variance.
Referenced by model.normalize: global_mvn. Written by collect_stats, read during train.

model:
  normalize: global_mvn
  normalize_conf:
    stats_file: ${stats_dir}/train/feats_stats.npz

Shared structure of training.yaml

The dataloader: block is read by both collect_stats and train. Variable interpolation (${stats_dir}/...) ensures that collect_stats output paths and train input paths always match automatically.

stats_dir: ${exp_dir}/stats

dataset:
  _target_: espnet3.components.data.data_organizer.DataOrganizer
  train: [...]
  valid: [...]

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1
  train:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      batches:
        type: numel
        shape_files:
          - ${stats_dir}/train/feats_shape  # written by collect_stats
        batch_bins: 1200000

model:
  normalize_conf:
    stats_file: ${stats_dir}/train/feats_stats.npz  # written by collect_stats

with vs without iter_factory

iter_factory: null Standard PyTorch DataLoader

Setting iter_factory to null falls back to a plain PyTorch DataLoader. Simple to configure, but padding grows quickly when sample lengths vary. Use when collect_stats is unnecessary.

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1
  train:
    iter_factory: null   # ESPnet iterator disabled
    batch_size: 8
    num_workers: 4
    shuffle: true
  valid:
    iter_factory: null
    batch_size: ${dataloader.train.batch_size}
    num_workers: ${dataloader.train.num_workers}
    shuffle: false

⚠ Fixed batch size of 8. When long and short samples are mixed, padding on short samples grows, hurting both GPU memory and throughput.

iter_factory enabled ESPnet IteratorFactory

SequenceIterFactory + a batch sampler automatically builds length-aware batches from feats_shape. The seed is fixed per epoch (seed + epoch), so training is fully reproducible across restarts.

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1
  train:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: true
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: numel        # ← choose strategy here
        batch_bins: 1200000
        shape_files:
          - ${stats_dir}/train/feats_shape
  valid:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: false
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: ${dataloader.train.iter_factory.batches.type}
        batch_bins: ${dataloader.train.iter_factory.batches.batch_bins}
        shape_files:
          - ${stats_dir}/valid/feats_shape

✓ Requires collect_stats. Once feats_shape is produced, batches are formed with variable sizes based on sequence length.

build_iter() call flow

ESPnetLightningModule calls iter_factory.build_iter(epoch) at the start of each epoch. Because the seed is fixed as seed + epoch, the same batch order is reproduced when training is resumed.

# How it's used internally (espnet3/components/modeling/lightning_module.py)
for epoch in range(max_epoch):
    iterator = iter_factory.build_iter(epoch)   # seed = base_seed + epoch
    for uids, batch in iterator:
        model(**batch)                           # speech, speech_lengths, text, ...

Batch strategy simulator

50 audio samples (1–15 s) visualised in real time. Same colour = same batch.

BATCH TYPE

BATCH_BINS2500

Batches0

Avg batch size0

Avg padding rate0%

Max length diff in batch0f

ChunkIterFactory — splitting long sequences into fixed-length chunks

Concept

Long audio is split into fixed-length windows (chunk_length) and batches are formed from those windows. Used in SpeechLM and long-form audio models. Because the model always sees fixed-length input, feats_shape is not required.

CHUNK_LENGTH300

SHIFT_RATIO0.5

BATCH_SIZE

POOL SHUFFLEoff

Utterance → chunk splitting

Each utterance is sliced at chunk_length. Same colour = same utterance.

Chunk pool → batch assembly

All chunks are pooled then grouped batch_size at a time. Every batch is the same length — padding = 0.

Total chunks0

Total batches0

chunk_length300f

shift150f (×0.5)

padding0

feats_shapenot needed

YAML config

CategoryChunkIterFactory (category-aware)

dataloader:
  train:
    iter_factory:
      _target_: espnet2.iterators.category_chunk_iter_factory.CategoryChunkIterFactory
      batch_size: 8
      chunk_length: 800
      batch_type: catbel
      sampler_args:
        category2utt_file: ${stats_dir}/train/utt2category
        batch_size: 8

Use when you need category-balanced batches on long sequences (e.g. speaker-balanced SpeechLM training).

Category iterators — balancing by speaker, language, dataset, etc.

catbel CategoryBalanced

Round-robin sampling so each batch contains an even mix of categories (speakers / languages / etc.). Best when you have a single dataset with class imbalance.

catpow CategoryPower

Power-law upsampling with upsampling_factor to boost under-resourced categories. Use within a single dataset when per-category data volumes differ (e.g. uneven speaker recording hours). Does not account for cross-dataset imbalance.

catpow_bal_ds DatasetPower

Two-stage balancing for multi-dataset training: ① dataset_upsampling_factor corrects the volume gap between datasets, ② category_upsampling_factor further corrects within-category skew. Example: LibriSpeech (1800 h) + CommonVoice (30 h) in a multilingual ASR recipe. Requires utt2dataset / dataset2utt.

MODE

BATCH_SIZE

spk_A (20)

spk_B (3)

spk_C (5)

spk_D (15)

spk_E (7)

YAML config example

Required file: utt2category

# stats_dir/train/utt2category  (category → utt id list)
spk_A utt_001 utt_006 utt_011 utt_016
spk_B utt_002 utt_007 utt_012
spk_C utt_003 utt_008
spk_D utt_004 utt_009 utt_014 utt_019 utt_024
spk_E utt_005 utt_010 utt_015

# utt2dataset
utt_001 librispeech
utt_002 commonvoice
utt_003 librispeech
...

# dataset2utt
librispeech utt_001 utt_003 utt_005 ...
commonvoice utt_002 utt_004 ...

Required files — quick reference

Summary of files required by each batching strategy. ✓ = required, — = not needed.

sampler / mode	iterator	feats_shape	utt2category	dataset2utt utt2dataset	primary use case
null (PyTorch DL)	—	—	—	—	—
unsorted	SequenceIter	—	—	—	Fixed batch size, random order
sorted	SequenceIter	✓	—	—	Reduced padding, fixed batch size
folded	SequenceIter	✓	—	—	Shrinks batch size for longer sequences
length	SequenceIter	✓	—	—	Bin-packing by total frames (1-D)
numel	SequenceIter	✓	—	—	Bin-packing by total elements (frames×dim)
chunk	ChunkIter	—	—	—	Fixed-length chunks, long-sequence models
catbel	CategoryIter	—	✓	—	Balanced category sampling
catpow	CategoryIter	✓	✓	—	Power-law upsampling per category
catpow_balance_dataset	CategoryIter	✓	✓	✓	Category + cross-dataset balancing
chunk + catbel	CategoryChunkIter	—	✓	—	Long sequences + category balance

Collate function

collate_fn is defined at the top level of the dataloader block and shared across all splits. It merges a list of samples into a padded batch tensor.

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1

When using iter_factory, the collate function is referenced via interpolation inside the factory config:

collate_fn: ${dataloader.collate_fn}

When using the standard PyTorch DataLoader path (iter_factory: null), the top-level collate_fn is picked up automatically — you do not repeat it under train or valid.

If the collate logic is recipe-specific, define it under egs3/<recipe>/<task>/src/ and reference it with _target_.

Sampler and batch_sampler overrides

For the standard DataLoader path, you can inject a custom sampler or batch_sampler at the top level of the dataloader block:

dataloader:
  sampler:
    _target_: torch.utils.data.RandomSampler
  train:
    iter_factory: null
    batch_size: 32
    num_workers: 4

sampler and batch_sampler are mutually exclusive — specifying both raises an error.

DataOrganizer

See how dataset splits are organized before batching.

Datasets

See the dataset-side builders and references used before batching.

Stats collection

See where shape files come from before the train dataloader uses them.

ESPnet3 Dataloader Configuration

ESPnet3 Dataloader Configuration

Collate function

Sampler and batch_sampler overrides

Related pages