Large-scale data
Large-scale data
This guide covers the dataloader settings that matter most when training on large datasets — sharding, dynamic batching, and the relationship between data pipeline config and multi-GPU runs.
Two paths through the dataloader
ESPnet3 supports two iteration modes. Both are configured under dataloader.train and dataloader.valid.
| Mode | When to use |
|---|---|
iter_factory (ESPnet-style) | Dynamic batching by frame/token count, built-in sharding, shuffle with reproducibility |
Standard DataLoader (iter_factory: null) | Fixed batch size, custom sampler, PyTorch-native iteration |
For large speech datasets, iter_factory with batch_bins is almost always the right choice.
Dynamic batching with batch_bins
Fixed batch sizes waste GPU memory because sequences vary in length. batch_bins packs samples until the total frame count hits a target, giving roughly equal compute per batch regardless of sequence length.
dataloader:
train:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: true
collate_fn: ${dataloader.collate_fn}
batches:
type: unsorted
shape_files:
- ${stats_dir}/train/feats_shape
batch_bins: 4000000batch_bins is measured in frames (or tokens, depending on the feature type). shape_files must point to the output of collect_stats, which writes per-utterance frame counts to ${stats_dir}/train/feats_shape.
Choosing batch_bins:
A rough starting point is to divide your target GPU memory budget by the memory cost of one frame. For 40 GB GPUs training on 80-dimensional filterbanks:
batch_bins: 2000000→ conservative, leaves room for long sequencesbatch_bins: 4000000→ typicalbatch_bins: 8000000→ aggressive; monitor for OOM on long-tail sequences
Increase batch_bins as long as GPU utilization rises without OOM errors. Long-tail sequences set the effective ceiling — if one utterance is unusually long, the batch it lands in may have only one sample.
Sharding for multi-GPU and multi-node
Each GPU needs a non-overlapping slice of the training data. Set total_shards and dist_world_size to activate this:
dataloader:
train:
total_shards: 16
dist_world_size: 16dist_world_size must equal num_nodes × num_device. total_shards must be divisible by dist_world_size.
For single-GPU runs, keep both at 1 (the default).
See Multi-node training for the full sharding rules and shard rotation formula.
Validation dataloader
The validation dataloader does not need shuffle, but it should mirror the training sharding settings to ensure each GPU only validates its own slice:
dataloader:
valid:
total_shards: ${dataloader.train.total_shards}
dist_world_size: ${dataloader.train.dist_world_size}
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: false
collate_fn: ${dataloader.collate_fn}
batches:
type: ${dataloader.train.iter_factory.batches.type}
shape_files:
- ${stats_dir}/valid/feats_shape
batch_bins: ${dataloader.train.iter_factory.batches.batch_bins}Using interpolation keeps the validation config in sync with training automatically.
Standard DataLoader path
For recipes that do not need dynamic batching or ESPnet-style sharding, set iter_factory: null and use PyTorch DataLoader fields directly:
dataloader:
train:
iter_factory: null
batch_size: 32
num_workers: 4
shuffle: trueCustom samplers are also supported at the top level of the dataloader block:
dataloader:
sampler:
_target_: torch.utils.data.WeightedRandomSampler
num_samples: 10000
replacement: true
train:
iter_factory: null
batch_size: 32sampler and batch_sampler are mutually exclusive.
Collecting stats for large datasets
collect_stats writes the feats_shape files that batch_bins depends on. It runs through the Dask-backed parallel layer and can be parallelized independently from training:
parallel:
env: local
n_workers: 8
options:
threads_per_worker: 1For very large datasets on a cluster, use a JobQueue backend instead:
parallel:
env: slurm
n_workers: 16
options:
queue: cpu
cores: 4
memory: 16GB
walltime: "02:00:00"n_workers here controls how many Dask workers run stats collection concurrently — it is independent from trainer.devices.
Common mistakes
Leaving dist_world_size: 1 for a multi-GPU run. The dataloader will raise a RuntimeError at the start of training because the configured dist_world_size (1) does not match the actual distributed world size (number of GPUs).
Setting total_shards to a value not divisible by dist_world_size. For example, total_shards: 12 with dist_world_size: 8 will fail. Round total_shards up to the nearest multiple.
Using iter_factory without running collect_stats first.batch_bins batching reads shape_files written by collect_stats. If those files do not exist, the dataloader build will fail.
Setting batch_bins too high for your dataset's longest sequences. A single very long utterance forces a batch with only that sample, which reduces GPU utilization for that step. Filter extreme outliers from training data or cap sequence length if this becomes a bottleneck.
