Cluster and parallel
Cluster and parallel
If you are coming from ESPnet2, this is one of the biggest mental-model changes.
In ESPnet2, parallel execution is mostly orchestrated by shell scripts such as:
egs2/<recipe>/asr1/asr.shegs2/<recipe>/asr1/local/data.sh
In ESPnet3, the shell layer is much thinner. Parallel work moves into Python through:
espnet3/parallel/parallel.pyespnet3/parallel/env_provider.pyespnet3/parallel/base_runner.pyespnet3/systems/base/inference_provider.pyespnet3/systems/base/inference_runner.py
ESPnet2: shell controls the parallelism
In ESPnet2, recipes usually expose shell variables such as:
njinference_njgpu_inferencetrain_cmdcuda_cmd
Then the recipe script splits inputs and fans out jobs itself.
Typical patterns in asr.sh are:
utils/split_scp.pl ...JOB=1:${_nj} ...run.pl,queue.pl,slurm.pl
Conceptually, the script does this:
- count input lines
- choose
_nj - split one SCP or text file into
_njshards - submit one shell job per shard
- merge the outputs afterward
That is why ESPnet2 recipes often feel like:
- one big shell controller
- many stage-local shell loops
- one-off splitting and merging logic per stage
Example: ESPnet2 decoding pattern
The common ESPnet2 decoding flow looks like this:
_nj=$(min "${inference_nj}" "$(<${key_file} wc -l)")
for n in $(seq "${_nj}"); do
split_scps+=" ${_logdir}/keys.${n}.scp"
done
utils/split_scp.pl "${key_file}" ${split_scps}
${_cmd} JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
python ... --key_file "${_logdir}"/keys.JOB.scpThe important point is not the exact command. The important point is where the responsibility lives:
- the shell script decides the shard count
- the shell script writes shard files
- the shell script launches one job per shard
Example: ESPnet2 data preparation pattern
local/data.sh usually follows the same style.
One shell stage does:
- download data
- extract data
- run Python helpers
- sort files
- split train/dev/test
For example, egs2/an4/asr1/local/data.sh:
- downloads AN4
- runs
local/data_prep.py - sorts
text,wav.scp,utt2spk - creates
train_devandtrain_nodev
So in ESPnet2, "parallel or cluster behavior" is usually recipe-shell behavior.
ESPnet3: Python owns the execution pattern
ESPnet3 moves that responsibility into Python objects.
The core split is:
EnvironmentProvider: build dataset/model/tokenizer/runtime envBaseRunner: apply one static compute function over indices
This means the execution pattern is no longer:
- "split files in shell, then call Python"
It becomes:
- "describe the runtime env in Python, then let the runner execute locally or on a cluster"
Side-by-side comparison
| Topic | ESPnet2 | ESPnet3 |
|---|---|---|
| Parallel control surface | shell vars like nj, inference_nj, train_cmd | YAML parallel block plus provider/runner code |
| Work splitting | shell-side SCP splitting | Python runner over indices |
| Worker environment | implicit in shell command line and filesystem | explicit env dict from provider |
| Cluster backend | run.pl, queue.pl, slurm.pl wrappers | Dask client built from parallel.env and parallel.options |
| Merge behavior | shell concatenation and stage-local scripts | runner hooks such as merge() |
| Reuse across local/HPC | often stage-specific shell logic | same Python code can run local or Dask |
What replaces nj and inference_nj
In ESPnet3, the closest replacement is usually:
parallel.n_workers- optional runner
batch_size
But the meaning is slightly different.
nj in ESPnet2 usually means:
- "how many shell jobs should I split this file into?"
n_workers in ESPnet3 usually means:
- "how many worker processes should the runtime create?"
So the old and new knobs are related, but not identical.
Parallel Config
See how n_workers and backend settings are expressed in YAML.
Parallel Runtime
See how ESPnet3 maps work locally or through Dask.
Config Diff
See where old nj-style recipe settings usually move.
What replaces run.pl / queue.pl
In ESPnet3, the backend choice is part of config.
Typical examples:
parallel:
env: local
n_workers: 8parallel:
env: local_gpu
n_workers: 4parallel:
env: slurm
n_workers: 16
options:
queue: batch
cores: 4
memory: 32GBSo instead of changing shell wrappers, you change the parallel config and keep the Python execution path the same.
Parallel Config
See local, local GPU, and cluster backend examples.
Provider and Runner
See how backend-independent work is written once in Python.
System and Stages
See how stage code receives config and launches stage behavior.
What replaces shell-side shard logic
In ESPnet3, shard logic lives in the runner layer.
That can mean:
- iterating plain indices locally
- mapping tasks through Dask
- using reducer hooks to write shard outputs
- merging shard outputs in
merge()
The closest current examples are:
espnet3/systems/base/inference_runner.pyespnet3/components/data/collect_stats.py
Provider and Runner
See BaseRunner hooks, worker envs, and merge behavior.
Inference Provider
See the provider contract used by parallel inference.
Stats Collection
See how collect_stats uses dataloader and runner-style execution.
Data preparation: what changes the most
This is the place where ESPnet2 users often expect more shell.
In ESPnet2:
local/data.shis often the center of gravity- stage logic is mostly shell + small Python helpers
In ESPnet3:
- recipe-local
dataset/builder.pyowns source preparation and build checks - heavier inner loops can move into provider/runner code
So the mapping is roughly:
| ESPnet2 | ESPnet3 |
|---|---|
local/data.sh | dataset/builder.py |
| shell stage loop | build() plus optional provider/runner helper |
| split files in shell | iterate indices in a runner |
Data Pipeline
See how local/data.sh maps to DatasetBuilder and dataset modules.
Parallel Data Prep
See when data preparation should use provider/runner execution.
Dataset Config
See how prepared data becomes train, valid, and test config.
A good migration rule
When reading an ESPnet2 recipe, ask:
- Which part is only stage ordering?
- Which part is only config?
- Which part is the real per-item computation?
Then convert them like this:
- stage ordering ->
run.pystage list - config ->
training.yaml,inference.yaml,metrics.yaml, ... - per-item computation -> dataset builder, provider, or runner
When you still do not need provider/runner
Do not over-apply the abstraction.
If a step is only:
- one archive download
- one extraction
- one quick manifest rewrite
then plain builder.py code is often enough.
Use provider/runner when the work is actually parallel-shaped:
- many files
- many utterances
- many download targets
- one repeated compute kernel over indices
Related pages
Data pipeline
See how dataset builders and recipe-local dataset modules replace old shell prep flows.
Parallel overview
Read the developer-facing provider and runner architecture.
Provider / Runner
See the core contract for worker env construction and execution.
Parallel config
See how local, local GPU, and HPC backends are configured in YAML.
Task to system
See how ESPnet2 task-level logic maps to ESPnet3 systems and stages.
