ESPnet3 Training Configuration

Masao SomekiElias NaskeAbout 3 min

ESPnet3 Training Configuration

This page describes the current training.yaml used to configure the following stages:

Overview

Section	Required	Description
`recipe_dir`, `data_dir`, `exp_dir`, ...	✅	path scaffold for outputs and cached assets
`num_device`, `num_nodes`		resource counts for training
`task`	✅ (or `model`)	ESPnet task entrypoint used to build an ESPnet2-style model
`model`	✅ (or `task`)	Custom model definition
`create_dataset`		dataset builder kwargs used by `create_dataset`
`dataset`	✅	train and valid dataset definitions resolved by `DataOrganizer`
`tokenizer`		tokenizer or text-builder settings
`dataloader`	✅	collate, iterator, sampler, and sharding settings
`optimizer`/`optimizers`, `scheduler`/`schedulers`	✅	optimization setup
`trainer`	✅	Lightning trainer arguments
`fit`		Lightning fit-time options
`parallel`		parallel processing settings

best_model_criterion

Path Scaffold

This section defines the paths used during training.

Default values

Key	Description	Default value
`num_device`	Number of devices used in training	`1`
`num_nodes`	Number of nodes used in training	`1`
`recipe_dir`	Path to the recipe directory	`.`
`data_dir`	Path to the raw data directory	`${recipe_dir}/data`
`exp_tag`	Identifier used to name the experiment	`${self_name:}`
`exp_dir`	Path to the experiment directory	`${recipe_dir}/exp/${exp_tag}`
`stats_dir`	Path to where the outputs of `collect_stats` are written	`${recipe_dir}/exp/stats`

exp_tag is important because it participates directly in experiment directory naming.

By default, TEMPLATE training.yaml uses:

exp_tag: ${self_name:}

That means exp_tag defaults to the config filename. For example, training_e_branchformer.yaml resolves to:

exp_tag: training_e_branchformer

See Resolvers for self_name.

Example

num_device: 1
num_nodes: 1

recipe_dir: .
data_dir: ${recipe_dir}/data
exp_tag: ${self_name:}
exp_dir: ${recipe_dir}/exp/${exp_tag}
stats_dir: ${recipe_dir}/exp/stats
dataset_dir: /path/to/your/dataset

Core config layout

This section should be read as a user-authored override config, not as the full TEMPLATE default.

Most recipes keep the default path scaffold from egs3/TEMPLATE/asr/conf/training.yaml and only override the task-specific parts they need.

Example:

task: espnet2.tasks.asr.ASRTask

create_dataset:
  recipe_dir: ${recipe_dir}

dataset:
  train:
    - data_src_args:
        split: train
  valid:
    - data_src_args:
        split: valid

tokenizer:
  vocab_size: 5000

dataloader:
  train:
    iter_factory:
      batches:
        type: sorted
        batch_size: 16

optimizer:
  lr: 0.002

scheduler:
  warmup_steps: 15000

trainer:
  log_every_n_steps: 100
  max_epochs: 10

`model`

If task is set, ESPnet3 uses the ESPnet2 task-side model definition. This is the normal way to reuse ESPnet2-style model config blocks.

If you want a custom model, leave task unset and instantiate the model directly via Hydra in model.

Example with task:

task: espnet2.tasks.asr.ASRTask

model:
  frontend: default
  encoder: e_branchformer
  decoder: transformer
  normalize: global_mvn
  normalize_conf:
    stats_file: ${stats_dir}/train/feats_stats.npz

In this case, model is interpreted as the task-side model config. This is usually the copy-and-adapt path from an ESPnet2 recipe config.

Example without task:

task:

model:
  _target_: my_project.models.MyASRModel
  vocab_size: 5000
  hidden_size: 256

In this case, model._target_ is required because ESPnet3 instantiates the model directly through Hydra.

`create_dataset`

create_dataset is the config block for the create_dataset stage.

The values in this block are forwarded to DatasetBuilder methods.

See these pages for details:

Settings

Key	Description
`recipe_dir`	Path to the recipe directory
`source_dir`	Path to the directory containing the code for the `create_dataset` stage

Example

create_dataset:
  recipe_dir: ${recipe_dir}
  source_dir: ${dataset_dir}

`dataset`

Dataset entries use DataOrganizer and dataset references.

Each dataset entry may resolve by:

dataset tag
explicit module path
omitted data_src -> ${recipe_dir}/dataset/__init__.py
local recipes often omit data_src and use ${recipe_dir}/dataset/__init__.py

Only data_src_args is passed to Dataset(...).

See Dataset references and builders for data_src details.

Example

dataset:
  recipe_dir: ${recipe_dir}
  train:
    - data_src_args:
        split: train
    - data_src: egs3.librispeech_100.asr.dataset
      data_src_args:
        split: train-clean-100
  valid:
    - data_src_args:
        split: valid

`dataloader`

Two common modes:

ESPnet iterator mode through iter_factory
plain PyTorch DataLoader mode with iter_factory: null

See Dataloader and Collate for iter_factory details, supported iterator factories, and full config examples.

Examples

Sequence Iterator:

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1
  train:
    iter_factory:
      _target_: espnet2.iterators.sequence_iter_factory.SequenceIterFactory
      shuffle: true
      collate_fn: ${dataloader.collate_fn}
      batches:
        type: numel
        shape_files:
          - ${stats_dir}/train/feats_shape
        batch_size: 4
        batch_bins: 4000000

multiple_iterator in ESPnet2 is not supported in current ESPnet3.

Standard DataLoader:

dataloader:
  collate_fn:
    _target_: espnet2.train.collate_fn.CommonCollateFn
    int_pad_value: -1
  train:
    iter_factory: null
    batch_size: 4
    num_workers: 2
    shuffle: true
  valid:
    iter_factory: null
    batch_size: 4
    num_workers: 2
    shuffle: false

`optimizer` / `scheduler`

scheduler_interval and scheduler_monitor work as follows:

Tag	Description
`scheduler_interval: step`	step the scheduler after optimizer updates
`scheduler_interval: epoch`	step the scheduler at epoch boundaries
`scheduler_monitor`	metric name used by monitored schedulers such as `ReduceLROnPlateau`

Notes:

step is the common choice for schedulers such as WarmupLR
epoch is used when the scheduler should react once per epoch
scheduler_monitor is only needed for schedulers that require a monitored value
use the same metric key that appears in logs, for example valid/loss

Named multi-optimizer path:

See Multiple optimizers and schedulers for the full behavior.

Default Values

Key	Default value
`optimizer._target_`	`torch.optim.Adam`
`optimizer.lr`	`0.002`
`scheduler._target_`	`espnet2.schedulers.warmup_lr.WarmupLR`
`scheduler.warmup_steps`	`15000`
`scheduler_interval`	`step`
`parallel.env`	`local`
`parallel.n_workers`	`1`

Examples

Single Optimizer:

optimizer:
  _target_: torch.optim.Adam
  lr: 0.001

scheduler:
  _target_: espnet2.schedulers.warmup_lr.WarmupLR
  warmup_steps: 1000

scheduler_interval: step
scheduler_monitor:

Multiple Optimizers:

optimizers:
  generator:
    optimizer:
      _target_: torch.optim.Adam
      lr: 0.0002
    params: generator
    gradient_clip_val: 1.0
    gradient_clip_algorithm: norm

  discriminator:
    optimizer:
      _target_: torch.optim.Adam
      lr: 0.0002
    params: discriminator

schedulers:
  generator:
    scheduler:
      _target_: torch.optim.lr_scheduler.LinearLR
      total_iters: 1000
    interval: step

  discriminator:
    scheduler:
      _target_: torch.optim.lr_scheduler.ReduceLROnPlateau
      patience: 2
    interval: epoch
    monitor: valid/discriminator/loss

`trainer`

trainer maps to Lightning trainer construction through ESPnet3LightningTrainer.

Example:

trainer:
  accelerator: auto
  devices: ${num_device}
  num_nodes: ${num_nodes}
  max_epochs: 10
  log_every_n_steps: 100

In multi-optimizer mode, trainer-level gradient clipping should not be used. See Multiple optimizers and schedulers for details.

`parallel`

This section configures parallel execution. Details are documented here:

Default Values

Key	Default value
`parallel.env`	`local`
`parallel.n_workers`	`1`

Examples

Minimal local example:

parallel:
  env: local
  n_workers: 1

Minimal SLURM example:

parallel:
  env: slurm
  n_workers: 8
  options:
    queue: gpu
    cores: 8
    processes: 1
    memory: 16GB
    walltime: 30:00
    job_extra_directives:
      - "--gres=gpu:1"

`fit`

training_config.fit is forwarded to trainer.fit(...).

This is where runtime fit-time overrides belong.

Example

Resume from checkpoint:

fit:
  ckpt_path: ${exp_dir}/last.ckpt

Other Training Settings

Settings

Key	Description
`init`	Weight initialization strategy
`best_model_criterion`	Criteria used to compare model performance

Example

init: xavier_uniform

best_model_criterion:
  - - valid/loss
    - 10
    - min