Multi-GPU

About 3 min

Multi-GPU

If you come from plain PyTorch, multi-GPU code in ESPnet3 usually feels easier because you do not need to write the distributed orchestration yourself.

The main rule is:

Important

ESPnet3 has two different parallel layers.

model training parallelism -> trainer
runner/provider parallelism -> parallel

Do not mix them up.

1. Multi-GPU training

If your goal is:

train one model on multiple GPUs
use DDP / FSDP / multi-node Lightning strategies

then the main config is trainer.

Typical example:

num_device: 4
num_nodes: 1

trainer:
  accelerator: gpu
  devices: ${num_device}
  num_nodes: ${num_nodes}
  strategy: ddp

That is already the most common multi-GPU training path.

If you know raw PyTorch, this replaces a lot of manual work such as:

process launching
DDP setup
rank/world-size handling
trainer loop wiring

Training Config

See trainer.devices, num_nodes, strategy, precision, and accelerator settings.

Trainer

See how ESPnet3 delegates training orchestration to Lightning.

Training Loop

See what replaces a hand-written PyTorch training loop.

2. Multi-GPU or multi-worker helper execution

If your goal is instead:

parallel inference
runner-based preprocessing
provider/runner workloads
Dask-backed helper execution

then the main config is parallel.

Typical local multi-GPU example:

parallel:
  env: local_gpu
  n_workers: 4

This means:

use one Dask worker per visible GPU
keep the Python compute path the same
let ESPnet3 handle the worker setup

Parallel Config

See local_gpu, n_workers, and backend options.

Provider and Runner

See how worker envs and repeated compute functions are defined.

Inference Provider

See the provider pattern used by parallel inference.

Why this is simpler than hand-written multiprocessing

In raw PyTorch code, people often write:

torch.multiprocessing
manual worker setup
custom queue logic
device assignment code
ad-hoc cluster wrappers

In ESPnet3, you usually do not need to write that by hand.

For training:

Lightning handles process orchestration

For runner/provider workloads:

ESPnet3 handles env construction and worker execution

So the main job becomes config, not orchestration code.

A small training example

For multi-GPU training on one node:

trainer:
  accelerator: gpu
  devices: 4
  strategy: ddp

For multi-node training:

trainer:
  accelerator: gpu
  devices: 8
  num_nodes: 2
  strategy: ddp

This is the part that replaces a lot of manual DDP launcher logic.

A small provider/runner example

Suppose you have a runner-based workload and want one worker per local GPU.

parallel:
  env: local_gpu
  n_workers: 2

Then your code can stay conceptually simple:

provider = MyProvider(cfg)
runner = MyRunner(provider)
runner(range(len(dataset)))

ESPnet3 handles:

worker startup
env setup on each worker
dispatch of forward(idx, **env)

So you do not need to write your own multi-process wrapper first.

Local CPU vs local GPU vs cluster

The three common patterns are:

CPU-only local helper execution

parallel:
  env: local
  n_workers: 8
  options:
    threads_per_worker: 1

Multi-GPU local helper execution

parallel:
  env: local_gpu
  n_workers: 4

HPC-backed helper execution

parallel:
  env: slurm
  n_workers: 16
  options:
    queue: gpu
    cores: 4
    memory: 32GB
    walltime: "04:00:00"

The Python code can stay the same across all three. That is one of the main benefits.

Cluster Migration

See what replaces nj, run.pl, queue.pl, and shell scheduler options.

Parallel Runtime

See how local and Dask execution share the same runner path.

Parallel Data Prep

See when helper work should use provider/runner execution.

A common confusion

This is the confusion most PyTorch users hit first:

trainer.devices

Use this for model training.

parallel.n_workers

Use this for runner/provider workloads.

These are related to parallelism, but they do not control the same mechanism.

Training Config

See where multi-GPU training is configured through Lightning.

Parallel Config

See how `local`, `local_gpu`, and HPC backends are configured.

Provider / Runner

Read the execution contract behind runner-based parallel workloads.

Inference Provider

See the stage-facing example of dataset/model env construction.

Model and system

See how custom models and custom stage flows fit into recipes.