ESPnet3: a modern major release

Pythonic, end-to-end speech workflows—from dataset creation to training, inference, evaluation, packaging, and demo generation.

Start with ESPnet3

Getting Started

Quick start for recipes and basic workflows.

Installation

Set up ESPnet3 and dependencies.

Config overview

How stage YAML configs are organized and used.

Stages

create_dataset

Download/build datasets for your recipe.

collect_stats

Compute feature shapes and global stats.

train

Run Lightning training with `train.yaml`.

infer

Write `.scp` outputs under `inference_dir`.

metric

Compute metrics from inference outputs.

Publish-related

Pack and upload model artifacts (`pack_model` / `upload_model`).

Demo stages

Generate and upload a demo UI.

System-specific stages

Add your own stages in the System class.

Developer resources

Systems

`espnet3/systems`: stage orchestration and task entry points.

Components

`espnet3/components`: reusable data/training/model/metric blocks.

Parallel

`espnet3/parallel`: Provider/Runner execution stack.

Demo

`espnet3/demo`: packing, runtime, and UI wiring.

How to cite ESPnet

@inproceedings{watanabe18_interspeech,
  title     = {ESPnet: End-to-End Speech Processing Toolkit},
  author    = {Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  year      = {2018},
  booktitle = {Proc. Interspeech},
  pages     = {2207--2211},
  doi       = {10.21437/Interspeech.2018-1456},
  issn      = {2958-1796},
}

To cite individual modules, models, or recipes, please refer to Additional Citations.