Large-scale data

Masao SomekiAbout 3 min

Large-scale data

This guide covers the dataloader settings that matter most when training on large datasets — sharding, dynamic batching, and the relationship between data pipeline config and multi-GPU runs.

Two paths through the dataloader

ESPnet3 supports two iteration modes. Both are configured under dataloader.train and dataloader.valid.

Mode	When to use
`iter_factory` (ESPnet-style)	Dynamic batching by frame/token count, built-in sharding, shuffle with reproducibility
Standard DataLoader (`iter_factory: null`)	Fixed batch size, custom sampler, PyTorch-native iteration

For large speech datasets, iter_factory with batch_bins is almost always the right choice.

Dynamic batching with batch_bins

Fixed batch sizes waste GPU memory because sequences vary in length. batch_bins packs samples until the total frame count hits a target, giving roughly equal compute per batch regardless of sequence length.

dataloader:
  train:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: true
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: unsorted
        shape_files:
          - ${stats_dir}/train/feats_shape
        batch_bins: 4000000

batch_bins is measured in frames (or tokens, depending on the feature type). shape_files must point to the output of collect_stats, which writes per-utterance frame counts to ${stats_dir}/train/feats_shape.

Choosing batch_bins:

A rough starting point is to divide your target GPU memory budget by the memory cost of one frame. For 40 GB GPUs training on 80-dimensional filterbanks:

batch_bins: 2000000 → conservative, leaves room for long sequences
batch_bins: 4000000 → typical
batch_bins: 8000000 → aggressive; monitor for OOM on long-tail sequences

Increase batch_bins as long as GPU utilization rises without OOM errors. Long-tail sequences set the effective ceiling — if one utterance is unusually long, the batch it lands in may have only one sample.

Sharding for multi-GPU and multi-node

Each GPU needs a non-overlapping slice of the training data. Set total_shards and dist_world_size to activate this:

dataloader:
  train:
    total_shards: 16
    dist_world_size: 16

dist_world_size must equal num_nodes × num_device. total_shards must be divisible by dist_world_size.

For single-GPU runs, keep both at 1 (the default).

See Multi-node training for the full sharding rules and shard rotation formula.

Validation dataloader

The validation dataloader does not need shuffle, but it should mirror the training sharding settings to ensure each GPU only validates its own slice:

dataloader:
  valid:
    total_shards: ${dataloader.train.total_shards}
    dist_world_size: ${dataloader.train.dist_world_size}
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: false
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: ${dataloader.train.iter_factory.batches.type}
        shape_files:
          - ${stats_dir}/valid/feats_shape
        batch_bins: ${dataloader.train.iter_factory.batches.batch_bins}

Using interpolation keeps the validation config in sync with training automatically.

Standard DataLoader path

For recipes that do not need dynamic batching or ESPnet-style sharding, set iter_factory: null and use PyTorch DataLoader fields directly:

dataloader:
  train:
    iter_factory: null
    batch_size: 32
    num_workers: 4
    shuffle: true

Custom samplers are also supported at the top level of the dataloader block:

dataloader:
  sampler:
    _target_: torch.utils.data.WeightedRandomSampler
    num_samples: 10000
    replacement: true
  train:
    iter_factory: null
    batch_size: 32

sampler and batch_sampler are mutually exclusive.

Collecting stats for large datasets

collect_stats writes the feats_shape files that batch_bins depends on. It runs through the Dask-backed parallel layer and can be parallelized independently from training:

parallel:
  env: local
  n_workers: 8
  options:
    threads_per_worker: 1

For very large datasets on a cluster, use a JobQueue backend instead:

parallel:
  env: slurm
  n_workers: 16
  options:
    queue: cpu
    cores: 4
    memory: 16GB
    walltime: "02:00:00"

n_workers here controls how many Dask workers run stats collection concurrently — it is independent from trainer.devices.

Common mistakes

Leaving dist_world_size: 1 for a multi-GPU run. The dataloader will raise a RuntimeError at the start of training because the configured dist_world_size (1) does not match the actual distributed world size (number of GPUs).

Setting total_shards to a value not divisible by dist_world_size. For example, total_shards: 12 with dist_world_size: 8 will fail. Round total_shards up to the nearest multiple.

Using iter_factory without running collect_stats first.batch_bins batching reads shape_files written by collect_stats. If those files do not exist, the dataloader build will fail.

Setting batch_bins too high for your dataset's longest sequences. A single very long utterance forces a batch with only that sample, which reduces GPU utilization for that step. Filter extreme outliers from training data or cap sequence length if this becomes a bottleneck.

Multi-node training

Set trainer.num_nodes and dataloader sharding together.

Dataloader Config

Full reference for iter_factory, collate_fn, and batch strategy.

Stats collection

See how collect_stats writes the shape files used by batch_bins.