ESPnet3 - Multi-GPU and Multi-Node Execution

Masao SomekiAbout 2 min

🚀 ESPnet3: Multi-GPU and Multi-Node Execution

ESPnet3 relies on PyTorch Lightning for distributed training and on the Provider/Runner abstraction for scalable inference or data processing. The same configuration runs locally, with multiple GPUs in a single job, or across SLURM clusters without modifying the Python code.

✅ Which part do you change for multi-GPU?

Concern	You configure	ESPnet3 / Lightning handles
Devices / nodes	`trainer.devices`, `trainer.num_nodes`	Spawning processes and setting up DDP
Strategy	`trainer.strategy` (`ddp`, `ddp_spawn`, etc.)	Communication, gradient sync, checkpointing
Cluster backend	`parallel` section (`env: slurm`, job options)	Dask client and job submission
Runners	Provider/Runner definitions for inference	Scheduling work across workers / GPUs

Training with PyTorch Lightning

Distributed training is configured directly in the experiment YAML file. The example below launches a data-parallel job on two nodes with four GPUs per node.

num_device: 4
num_nodes: 2

trainer:
  accelerator: gpu
  strategy: ddp
  precision: 16-mixed
  gradient_clip_val: 1.0
  log_every_n_steps: 200

Lightning handles process spawning, communication, gradient accumulation, and checkpointing. No wrapper scripts are required because espnet3.components.trainer reads this configuration and forwards it to Lightning.

When running under a scheduler (e.g., SLURM) make sure the submission command requests matching resources, for example:

sbatch --nodes=2 --gres=gpu:4 --cpus-per-task=8 train.sh

The training script itself is unchanged between local and cluster runs.

Inference or evaluation with runners

For multi-GPU inference, decoding, or scoring jobs ESPnet3 provides the BaseRunner class. A runner processes indices in parallel while an EnvironmentProvider constructs datasets and models on each worker.

from espnet3.parallel.base_runner import BaseRunner
from espnet3.parallel.parallel import set_parallel
from espnet3.systems.base.inference_provider import InferenceProvider

class DecodeProvider(InferenceProvider):
    @staticmethod
    def build_dataset(cfg):
        return load_eval_split(cfg.dataset)

    @staticmethod
    def build_model(cfg):
        model = build_model(cfg.model)
        return model.to(cfg.model.device)

class DecodeRunner(BaseRunner):
    @staticmethod
    def forward(idx: int, *, dataset, model, beam_size=5):
        sample = dataset[idx]
        return {
            "utt_id": sample["utt_id"],
            "hyp": model.decode(sample, beam_size=beam_size),
        }

provider = DecodeProvider(cfg, params={"beam_size": 8})
runner = DecodeRunner(provider)

set_parallel(cfg.parallel_gpu)  # same config works locally or on SLURM
results = runner(range(len(eval_set)))

Workers automatically receive their own dataset/model instances. To bind one GPU per worker specify env: slurm (or any Dask JobQueue backend) and use job_extra_directives such as --gres=gpu:1 in the parallel configuration.

Local vs. cluster performance

The refactored runner supports three modes: local, synchronous cluster jobs, and asynchronous cluster submissions. The same decoding runner was benchmarked in #6178 on an A40 GPU with the OWSM-V4 medium (1B) model over 1,000 LibriSpeech test-clean utterances:

Environment	#GPUs	Wall time (s)
local	1	1220
local	2	695
slurm / sync	1	1336
slurm / sync	2	669
slurm / sync	4	369

In synchronous mode the driver waits for all workers to finish, whereas in asynchronous mode the submission script becomes a lightweight dispatcher and the worker jobs continue even if the driver exits early.

By combining Lightning for training and the Provider/Runner API for inference, ESPnet3 offers a uniform interface for single-GPU experiments, multi-GPU servers, and large-scale clusters.