ESPnet3 Training Configuration
ESPnet3 Training Configuration
This page describes the current training.yaml used to configure the following stages:
Overview
| Section | Required | Description |
|---|---|---|
recipe_dir, data_dir, exp_dir, ... | ✅ | path scaffold for outputs and cached assets |
num_device, num_nodes | resource counts for training | |
task | ✅ (or model) | ESPnet task entrypoint used to build an ESPnet2-style model |
model | ✅ (or task) | Custom model definition |
create_dataset | dataset builder kwargs used by create_dataset | |
dataset | ✅ | train and valid dataset definitions resolved by DataOrganizer |
tokenizer | tokenizer or text-builder settings | |
dataloader | ✅ | collate, iterator, sampler, and sharding settings |
optimizer/optimizers, scheduler/schedulers | ✅ | optimization setup |
trainer | ✅ | Lightning trainer arguments |
fit | Lightning fit-time options | |
parallel | parallel processing settings |
best_model_criterion
Path Scaffold
This section defines the paths used during training.
Default values
| Key | Description | Default value |
|---|---|---|
num_device | Number of devices used in training | 1 |
num_nodes | Number of nodes used in training | 1 |
recipe_dir | Path to the recipe directory | . |
data_dir | Path to the raw data directory | ${recipe_dir}/data |
exp_tag | Identifier used to name the experiment | ${self_name:} |
exp_dir | Path to the experiment directory | ${recipe_dir}/exp/${exp_tag} |
stats_dir | Path to where the outputs of collect_stats are written | ${recipe_dir}/exp/stats |
exp_tag is important because it participates directly in experiment directory naming.
By default, TEMPLATE training.yaml uses:
exp_tag: ${self_name:}That means exp_tag defaults to the config filename. For example, training_e_branchformer.yaml resolves to:
exp_tag: training_e_branchformerSee Resolvers for self_name.
Example
num_device: 1
num_nodes: 1
recipe_dir: .
data_dir: ${recipe_dir}/data
exp_tag: ${self_name:}
exp_dir: ${recipe_dir}/exp/${exp_tag}
stats_dir: ${recipe_dir}/exp/stats
dataset_dir: /path/to/your/datasetCore config layout
This section should be read as a user-authored override config, not as the full TEMPLATE default.
Most recipes keep the default path scaffold from egs3/TEMPLATE/asr/conf/training.yaml and only override the task-specific parts they need.
Example:
task: espnet2.tasks.asr.ASRTask
create_dataset:
recipe_dir: ${recipe_dir}
dataset:
train:
- data_src_args:
split: train
valid:
- data_src_args:
split: valid
tokenizer:
vocab_size: 5000
dataloader:
train:
iter_factory:
batches:
type: sorted
batch_size: 16
optimizer:
lr: 0.002
scheduler:
warmup_steps: 15000
trainer:
log_every_n_steps: 100
max_epochs: 10model
If task is set, ESPnet3 uses the ESPnet2 task-side model definition. This is the normal way to reuse ESPnet2-style model config blocks.
If you want a custom model, leave task unset and instantiate the model directly via Hydra in model.
Example with task:
task: espnet2.tasks.asr.ASRTask
model:
frontend: default
encoder: e_branchformer
decoder: transformer
normalize: global_mvn
normalize_conf:
stats_file: ${stats_dir}/train/feats_stats.npzIn this case, model is interpreted as the task-side model config. This is usually the copy-and-adapt path from an ESPnet2 recipe config.
Example without task:
task:
model:
_target_: my_project.models.MyASRModel
vocab_size: 5000
hidden_size: 256In this case, model._target_ is required because ESPnet3 instantiates the model directly through Hydra.
create_dataset
create_dataset is the config block for the create_dataset stage.
The values in this block are forwarded to DatasetBuilder methods.
See these pages for details:
Settings
| Key | Description |
|---|---|
recipe_dir | Path to the recipe directory |
source_dir | Path to the directory containing the code for the create_dataset stage |
Example
create_dataset:
recipe_dir: ${recipe_dir}
source_dir: ${dataset_dir}dataset
Dataset entries use DataOrganizer and dataset references.
Each dataset entry may resolve by:
- dataset tag
- explicit module path
- omitted
data_src->${recipe_dir}/dataset/__init__.py - local recipes often omit
data_srcand use${recipe_dir}/dataset/__init__.py
Only data_src_args is passed to Dataset(...).
See Dataset references and builders for data_src details.
Example
dataset:
recipe_dir: ${recipe_dir}
train:
- data_src_args:
split: train
- data_src: egs3.librispeech_100.asr.dataset
data_src_args:
split: train-clean-100
valid:
- data_src_args:
split: validdataloader
Two common modes:
- ESPnet iterator mode through
iter_factory - plain PyTorch DataLoader mode with
iter_factory: null
See Dataloader and Collate for iter_factory details, supported iterator factories, and full config examples.
Examples
Sequence Iterator:
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
train:
iter_factory:
_target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
shuffle: true
collate_fn: ${dataloader.collate_fn}
batches:
type: numel
shape_files:
- ${stats_dir}/train/feats_shape
batch_size: 4
batch_bins: 4000000multiple_iterator in ESPnet2 is not supported in current ESPnet3.
Standard DataLoader:
dataloader:
collate_fn:
_target_: espnet2.train.collate_fn.CommonCollateFn
int_pad_value: -1
train:
iter_factory: null
batch_size: 4
num_workers: 2
shuffle: true
valid:
iter_factory: null
batch_size: 4
num_workers: 2
shuffle: falseoptimizer / scheduler
scheduler_interval and scheduler_monitor work as follows:
| Tag | Description |
|---|---|
scheduler_interval: step | step the scheduler after optimizer updates |
scheduler_interval: epoch | step the scheduler at epoch boundaries |
scheduler_monitor | metric name used by monitored schedulers such as ReduceLROnPlateau |
Notes:
stepis the common choice for schedulers such asWarmupLRepochis used when the scheduler should react once per epochscheduler_monitoris only needed for schedulers that require a monitored value- use the same metric key that appears in logs, for example
valid/loss
Named multi-optimizer path:
See Multiple optimizers and schedulers for the full behavior.
Default Values
| Key | Default value |
|---|---|
optimizer._target_ | torch.optim.Adam |
optimizer.lr | 0.002 |
scheduler._target_ | espnet2.schedulers.warmup_lr.WarmupLR |
scheduler.warmup_steps | 15000 |
scheduler_interval | step |
parallel.env | local |
parallel.n_workers | 1 |
Examples
Single Optimizer:
optimizer:
_target_: torch.optim.Adam
lr: 0.001
scheduler:
_target_: espnet2.schedulers.warmup_lr.WarmupLR
warmup_steps: 1000
scheduler_interval: step
scheduler_monitor:Multiple Optimizers:
optimizers:
generator:
optimizer:
_target_: torch.optim.Adam
lr: 0.0002
params: generator
gradient_clip_val: 1.0
gradient_clip_algorithm: norm
discriminator:
optimizer:
_target_: torch.optim.Adam
lr: 0.0002
params: discriminator
schedulers:
generator:
scheduler:
_target_: torch.optim.lr_scheduler.LinearLR
total_iters: 1000
interval: step
discriminator:
scheduler:
_target_: torch.optim.lr_scheduler.ReduceLROnPlateau
patience: 2
interval: epoch
monitor: valid/discriminator/losstrainer
trainer maps to Lightning trainer construction through ESPnet3LightningTrainer.
Example:
trainer:
accelerator: auto
devices: ${num_device}
num_nodes: ${num_nodes}
max_epochs: 10
log_every_n_steps: 100In multi-optimizer mode, trainer-level gradient clipping should not be used. See Multiple optimizers and schedulers for details.
parallel
This section configures parallel execution. Details are documented here:
Default Values
| Key | Default value |
|---|---|
parallel.env | local |
parallel.n_workers | 1 |
Examples
Minimal local example:
parallel:
env: local
n_workers: 1Minimal SLURM example:
parallel:
env: slurm
n_workers: 8
options:
queue: gpu
cores: 8
processes: 1
memory: 16GB
walltime: 30:00
job_extra_directives:
- "--gres=gpu:1"fit
training_config.fit is forwarded to trainer.fit(...).
This is where runtime fit-time overrides belong.
Example
Resume from checkpoint:
fit:
ckpt_path: ${exp_dir}/last.ckptOther Training Settings
Settings
| Key | Description |
|---|---|
init | Weight initialization strategy |
best_model_criterion | Criteria used to compare model performance |
Example
init: xavier_uniform
best_model_criterion:
- - valid/loss
- 10
- min