This page explains how the dataloader section in training.yaml controls batch construction and iteration during training.
The interactive demo below covers all major paths — pipeline overview, iter_factory on/off, batch strategy comparison, chunk splitting, and category-balanced sampling.
① Overview
② iter_factory on/off
③ Batch strategy simulator
④ ChunkIterFactory
⑤ Category iterators
⑥ Required files
Pipeline overview
Data flow in the ESPnet3 training pipeline
The collect_stats stage generates all files consumed by the subsequent train stage. Both stages share the same dataloader: block in training.yaml.
collect_stats
model.collect_feats()
→
feats_shape
stats_dir/train/
+
feats_stats.npz
for GlobalMVN
→
train
ESPnetLightningModule
① feats_shape — for batching
A text file recording the number of frames per sample.
Passed to batches.shape_files in SequenceIterFactory, enabling length-aware batching (reduced padding, OOM prevention).
# stats_dir/train/feats_shape
utt_001 312
utt_002 489
utt_003 156
utt_004 701
...
② feats_stats.npz — for normalization
A compressed numpy archive storing dataset-wide mean / variance.
Referenced by model.normalize: global_mvn. Written by collect_stats, read during train.
model:
normalize: global_mvn
normalize_conf:
stats_file: ${stats_dir}/train/feats_stats.npzShared structure of training.yaml
The dataloader: block is read by both collect_stats and train. Variable interpolation (${stats_dir}/...) ensures that collect_stats output paths and train input paths always match automatically.
stats_dir: ${exp_dir}/stats
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
train: [...]
valid: [...]
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
train:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
batches:
type: numel
shape_files:
- ${stats_dir}/train/feats_shape # written by collect_stats
batch_bins: 1200000
model:
normalize_conf:
stats_file: ${stats_dir}/train/feats_stats.npz # written by collect_statswith vs without iter_factory
iter_factory: null Standard PyTorch DataLoader
Setting iter_factory to null falls back to a plain PyTorch DataLoader. Simple to configure, but padding grows quickly when sample lengths vary. Use when collect_stats is unnecessary.
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
train:
iter_factory: null # ESPnet iterator disabled
batch_size: 8
num_workers: 4
shuffle: true
valid:
iter_factory: null
batch_size: ${dataloader.train.batch_size}
num_workers: ${dataloader.train.num_workers}
shuffle: false ⚠ Fixed batch size of 8. When long and short samples are mixed, padding on short samples grows, hurting both GPU memory and throughput.
iter_factory enabled ESPnet IteratorFactory
SequenceIterFactory + a batch sampler automatically builds length-aware batches from feats_shape. The seed is fixed per epoch (seed + epoch), so training is fully reproducible across restarts.
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
train:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: true
collate_fn: ${dataloader.collate_fn}
batches:
type: numel # ← choose strategy here
batch_bins: 1200000
shape_files:
- ${stats_dir}/train/feats_shape
valid:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: false
collate_fn: ${dataloader.collate_fn}
batches:
type: ${dataloader.train.iter_factory.batches.type}
batch_bins: ${dataloader.train.iter_factory.batches.batch_bins}
shape_files:
- ${stats_dir}/valid/feats_shape ✓ Requires collect_stats. Once feats_shape is produced, batches are formed with variable sizes based on sequence length.
build_iter() call flow
ESPnetLightningModule calls iter_factory.build_iter(epoch) at the start of each epoch. Because the seed is fixed as seed + epoch, the same batch order is reproduced when training is resumed.
# How it's used internally (espnet3/components/modeling/lightning_module.py)
for epoch in range(max_epoch):
iterator = iter_factory.build_iter(epoch) # seed = base_seed + epoch
for uids, batch in iterator:
model(**batch) # speech, speech_lengths, text, ...Batch strategy simulator
50 audio samples (1–15 s) visualised in real time. Same colour = same batch.
Batches0
Avg batch size0
Avg padding rate0%
Max length diff in batch0f
ChunkIterFactory — splitting long sequences into fixed-length chunks
Concept
Long audio is split into fixed-length windows (chunk_length) and batches are formed from those windows. Used in SpeechLM and long-form audio models. Because the model always sees fixed-length input, feats_shape is not required.
Utterance → chunk splitting
Each utterance is sliced at chunk_length. Same colour = same utterance.
Chunk pool → batch assembly
All chunks are pooled then grouped batch_size at a time. Every batch is the same length — padding = 0.
Total chunks0
Total batches0
chunk_length300f
shift150f (×0.5)
padding0
feats_shapenot needed
YAML config
CategoryChunkIterFactory (category-aware)
dataloader:
train:
iter_factory:
_target_: espnet2.iterators.category_chunk_iter_factory.CategoryChunkIterFactory
batch_size: 8
chunk_length: 800
batch_type: catbel
sampler_args:
category2utt_file: ${stats_dir}/train/utt2category
batch_size: 8 Use when you need category-balanced batches on long sequences (e.g. speaker-balanced SpeechLM training).
Category iterators — balancing by speaker, language, dataset, etc.
catbel CategoryBalanced
Round-robin sampling so each batch contains an even mix of categories (speakers / languages / etc.). Best when you have a single dataset with class imbalance.
catpow CategoryPower
Power-law upsampling with upsampling_factor to boost under-resourced categories. Use within a single dataset when per-category data volumes differ (e.g. uneven speaker recording hours). Does not account for cross-dataset imbalance.
catpow_bal_ds DatasetPower
Two-stage balancing for multi-dataset training: ① dataset_upsampling_factor corrects the volume gap between datasets, ② category_upsampling_factor further corrects within-category skew. Example: LibriSpeech (1800 h) + CommonVoice (30 h) in a multilingual ASR recipe. Requires utt2dataset / dataset2utt.
YAML config example
Required file: utt2category
# stats_dir/train/utt2category (category → utt id list)
spk_A utt_001 utt_006 utt_011 utt_016
spk_B utt_002 utt_007 utt_012
spk_C utt_003 utt_008
spk_D utt_004 utt_009 utt_014 utt_019 utt_024
spk_E utt_005 utt_010 utt_015
# utt2dataset
utt_001 librispeech
utt_002 commonvoice
utt_003 librispeech
...
# dataset2utt
librispeech utt_001 utt_003 utt_005 ...
commonvoice utt_002 utt_004 ...
Required files — quick reference
Summary of files required by each batching strategy. ✓ = required, — = not needed.
| sampler / mode | iterator | feats_shape | utt2category | dataset2utt utt2dataset | primary use case |
|---|
| null (PyTorch DL) | — | — | — | — | — |
| unsorted | SequenceIter | — | — | — | Fixed batch size, random order |
| sorted | SequenceIter | ✓ | — | — | Reduced padding, fixed batch size |
| folded | SequenceIter | ✓ | — | — | Shrinks batch size for longer sequences |
| length | SequenceIter | ✓ | — | — | Bin-packing by total frames (1-D) |
| numel | SequenceIter | ✓ | — | — | Bin-packing by total elements (frames×dim) |
| chunk | ChunkIter | — | — | — | Fixed-length chunks, long-sequence models |
| catbel | CategoryIter | — | ✓ | — | Balanced category sampling |
| catpow | CategoryIter | ✓ | ✓ | — | Power-law upsampling per category |
| catpow_balance_dataset | CategoryIter | ✓ | ✓ | ✓ | Category + cross-dataset balancing |
| chunk + catbel | CategoryChunkIter | — | ✓ | — | Long sequences + category balance |
collate_fn is defined at the top level of the dataloader block and shared across all splits. It merges a list of samples into a padded batch tensor.
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
When using iter_factory, the collate function is referenced via interpolation inside the factory config:
collate_fn: ${dataloader.collate_fn}
When using the standard PyTorch DataLoader path (iter_factory: null), the top-level collate_fn is picked up automatically — you do not repeat it under train or valid.
If the collate logic is recipe-specific, define it under egs3/<recipe>/<task>/src/ and reference it with _target_.
For the standard DataLoader path, you can inject a custom sampler or batch_sampler at the top level of the dataloader block:
dataloader:
sampler:
_target_: torch.utils.data.RandomSampler
train:
iter_factory: null
batch_size: 32
num_workers: 4
sampler and batch_sampler are mutually exclusive — specifying both raises an error.