Logging and debug
Logging and debug
If you come from plain PyTorch, ESPnet3 already logs more than a small custom training script usually does.
The main idea is:
- stage logs are written automatically
- configs are dumped into logs
- trainer/logger/profiler are configured from YAML
- Lightning features such as W&B and
fast_dev_runare available
First place to look
When something is wrong, start with the stage log under the relevant output directory.
Typical places are:
exp/.../train.logexp/.../inference/.../infer.logexp/.../inference/.../measure.log
ESPnet3 switches the stage log handler automatically when stages run.
System and Stages
See how stage execution selects log locations.
Train Stage
See where training logs and config inputs come from.
Inference Stage
See how inference writes per-test-set outputs and logs.
Logging utilities
ESPnet3 logging is built around a small utility layer in espnet3.utils.logging_utils.
The usual entrypoints are:
| Utility | Role |
|---|---|
configure_logging() | configure console logging and optional run log file |
set_stage_log_handler() | switch the active stage log file |
log_stage() | add the current stage label to log records |
log_stage_metadata() | log command, config paths, environment, and resolved config |
log_run_metadata() | log command, Python, git, and optional requirements metadata |
log_env_metadata() | log CUDA, distributed, runtime, and scheduler environment |
For most recipes, you do not call these manually. run_stages() calls them before each stage runs.
Minimal manual setup looks like this:
import logging
from pathlib import Path
from espnet3.utils.logging_utils import configure_logging, log_run_metadata
logger = configure_logging(log_dir=Path("exp/my_run"), level=logging.INFO)
log_run_metadata(logger, write_requirements=True)
logger.info("custom setup done")That creates console output and exp/my_run/run.log.
configure_logging
See the root logger, console handler, run.log, and warning capture setup.
log_run_metadata
See command, Python, git, config path, and requirements logging.
log_env_metadata
See CUDA, distributed, runtime, and scheduler environment logging.
Automatic log rotation
ESPnet3 rotates logs when a target file already exists. This is startup rotation, not size-based rotation.
For example, if run.log already exists:
exp/my_run/
run.logthen the next configure_logging(log_dir=Path("exp/my_run")) call moves the old file and creates a fresh current log:
exp/my_run/
run.log # current run
run1.log # previous runIf run1.log already exists, the old file becomes run2.log, and so on. The same rule is used for stage logs such as train.log and infer.log.
Stage logging uses one active stage file handler at a time. When the stage changes, set_stage_log_handler() removes the previous stage handler, rotates the next target if needed, and attaches the new handler.
exp/asr_train/
train.log
train1.log
exp/asr_train/inference/test/
infer.log
infer1.logFor distributed training, the train stage defaults to rank0 file logging. Non-zero ranks skip the stage file handler to avoid multiple processes rotating the same file.
Set stage_log_mode: per_rank in the training config when you need one file per rank:
stage_log_mode: per_rankThat writes files such as:
exp/asr_train/
train_rank0.log
train_rank1.logset_stage_log_handler
See how stage log handlers are replaced and rotated.
log_stage
See how the stage label is attached to each log record.
log_stage_metadata
See what metadata is written at the start of each stage.
What ESPnet3 logs automatically
Run metadata
ESPnet3 logs:
- start timestamp
- full command line
- Python executable
- Python version
- working directory
- config file paths
- git commit / branch / dirty state
If requested, it also writes:
requirements.txt
next to the log file.
2. Environment metadata
ESPnet3 also logs environment information such as:
CUDA_VISIBLE_DEVICESRANK,LOCAL_RANK,WORLD_SIZEMASTER_ADDR,MASTER_PORTPATH,PYTHONPATH,LD_LIBRARY_PATH- cluster scheduler variables such as
SLURM_*,PBS_*,LSF_* - runtime variables such as
NCCL_*,CUDA_*,OMP_*,MKL_*
This is very useful when debugging multi-node or cluster issues.
3. Resolved config contents
Per-stage metadata logging also dumps resolved config YAML into the log.
That includes, when present:
training_configinference_configmetrics_config
So if you want to know what config actually ran, the log already contains it.
4. Component structure
ESPnet3 logs several instantiated components explicitly.
Examples include:
- dataloader details
- callback objects
- provider env objects
- metric objects
- model-side components in some paths
This is useful when checking whether the config instantiated the object you expected.
5. Training metrics
The default training callbacks log:
- batch-level summaries every
log_every_n_steps - end-of-epoch train summaries
- end-of-epoch validation summaries
- learning rates
The summary lines include timings such as:
iter_timeforward_timebackward_timeoptim_step_timetrain_timevalid_time
So you already get some lightweight performance visibility without adding your own callback.
Default training callbacks
ESPnet3 adds these by default:
- last checkpoint saving
- best-k checkpoint saving
- averaged best-k checkpoint saving
- learning-rate monitor
- metrics logger
- TQDM progress bar
So even a small recipe already has a more opinionated logging setup than a minimal raw PyTorch loop.
Logger configuration
Training logger configuration lives under trainer.logger.
For example, TensorBoard:
trainer:
logger:
- _target_: lightning.pytorch.loggers.TensorBoardLogger
save_dir: ${exp_dir}/tensorboard
name: tb_loggerBecause ESPnet3 uses Lightning here, you can also use other Lightning loggers.
For example, W&B:
trainer:
logger:
- _target_: lightning.pytorch.loggers.WandbLogger
project: my-espnet3-project
name: ${exp_tag}
save_dir: ${exp_dir}Other common options include:
CSVLoggerTensorBoardLoggerWandbLogger
So if you already know Lightning logger config, it transfers directly.
Training Config
See trainer.logger, profiler, and fast debug settings in YAML.
Trainer
See the trainer wrapper and its Lightning integration points.
Callbacks
See default checkpoint, LR, progress, and metrics callbacks.
Profiler configuration
Profiler settings live under trainer.profiler.
Because this is passed to Lightning, you can use Lightning profiler configs directly.
Example:
trainer:
profiler:
_target_: lightning.pytorch.profilers.SimpleProfiler
dirpath: ${exp_dir}
filename: profiler.txtOr:
trainer:
profiler:
_target_: lightning.pytorch.profilers.AdvancedProfiler
dirpath: ${exp_dir}
filename: profiler.txtUse this when:
- training is unexpectedly slow
- validation is much slower than expected
- you want a first pass before deeper PyTorch profiling
Fast debug knobs
If you just want to see whether the whole training path runs once, Lightning already gives you useful debug flags through trainer.
fast_dev_run
This is the fastest sanity-check setting.
trainer:
fast_dev_run: trueThat runs a tiny end-to-end debug pass so you can confirm:
- dataset works
- collate works
- model forward works
- validation step works
- logging setup works
This is one of the best first checks when wiring a new recipe.
Other useful trainer-side debug settings
Examples:
trainer:
limit_train_batches: 2
limit_val_batches: 2
max_epochs: 1These are often easier than editing the dataset itself for quick checks.
Good debugging order
When something breaks, use this order:
- run with
fast_dev_run: true - inspect the stage log
- confirm the resolved config dump
- confirm the instantiated dataloader / callbacks / logger
- only then add profiler settings if the issue is performance
What to check for performance issues
For performance debugging, the built-in summary logs already tell you whether time is going into:
- iteration
- forward
- backward
- optimizer step
- validation
If that is not enough, add a profiler under trainer.profiler.
For cluster or multi-GPU issues, also inspect the logged environment variables.
Multi-GPU
See how trainer parallelism differs from provider/runner parallelism.
Parallel Config
See cluster and worker settings that show up in logs.
Dataloader
See sampler and iterator behavior when performance issues start in loading.
Why this feels different from plain PyTorch
In plain PyTorch, you often write:
- your own logger setup
- your own metric summaries
- your own profiler plumbing
- your own W&B integration
In ESPnet3, much of that is already structured through:
- stage logging
- Lightning trainer config
- default callbacks
- config-driven logger/profiler objects
So the main job is usually to configure the right knobs, not to build logging from scratch.
Related pages
Training loop
See how to customize LightningModule, trainer, or multi-stage training.
Training Config
See where `trainer`, `logger`, and `profiler` are configured.
Trainer
Read the detailed ESPnet3 Lightning trainer wrapper behavior.
Parallel Config
See how environment and worker settings are logged for cluster runs.
Provider / Runner
Read the execution layer used by runner-based inference and helper workloads.
