ESPnet3 Installation

Masao SomekiAbout 3 min

ESPnet3 Installation

ESPnet3 uses the same Python package name as ESPnet2: espnet.

Quick install (pip)

pip install espnet

System extras (ASR/TTS/ST/ENH, etc.):

pip install "espnet[asr]"
pip install "espnet[tts]"
pip install "espnet[st]"
pip install "espnet[enh]"

Supported extras (from pyproject.toml):

Extra	Packages	Description
`asr`	`ctc-segmentation`, `editdistance`, `opt_einsum`, `jiwer`	ASR alignment and scoring (e.g., WER).
`tts`	`pyworld`, `pypinyin`, `espnet_tts_frontend`, `g2p_en`, `jamo`, `jaconv`	TTS frontends, G2P, and language processing.
`enh`	`ci_sdr`, `fast-bss-eval`	Speech enhancement metrics.
`asr2`	`editdistance`	ESPnet2-style ASR extras.
`s2st`	`editdistance`, `s3prl`	Speech-to-speech translation + SSL features.
`st`	`editdistance`	Speech translation scoring.
`s2t`	`editdistance`	Speech-to-text translation scoring.
`spk`	`asteroid_filterbanks`	Speaker tasks.
`dev`	`black`, `flake8`, `pytest`, `pytest-cov`, `isort`	Developer tooling (format/lint/test).
`test`	`pytest`, `pytest-timeouts`, `pytest-pythonpath`, `pytest-cov`, `hacking`, `mock`, `pycodestyle`, `jsondiff`, `flake8`, `flake8-docstrings`, `black`, `isort`, `h5py`	Test stack used by CI.
`doc`	`sphinx`, `sphinx-rtd-theme`, `myst-parser`, `sphinx-argparse`, `sphinx-markdown-builder`, `sphinx-markdown-tables`	Documentation build tools.
`all`	`espnet[asr]`, `espnet[tts]`, `espnet[enh]`, `espnet[spk]`, `fairscale`, `transformers`, `evaluate`	Convenience meta extra.

Using uv

uv is fast and works well for reproducible Python environments. It makes it easy to pin Python versions and manage isolated virtualenvs without system-wide installs.

uv venv .venv
. .venv/bin/activate
uv pip install espnet

For extras:

uv pip install "espnet[asr]"

Using pixi

Pixi can manage Python and system dependencies together. This lets you install packages that previously required conda-forge entirely in user space, without system-level package managers.

pixi init
pixi add python=3.10 pip
pixi run pip install espnet

Install from source (recommended for development)

git clone https://github.com/espnet/espnet.git
cd espnet/tools
. setup_uv.sh

Then install the editable package with extras as needed:

cd ..
uv pip install -e ".[asr]"

Recipe tool installers (optional)

Some recipes rely on external tools (e.g., sph2pipe). If you need them, refer to the installer scripts under tools/installers/. After creating your env with uv or pixi, you can run the installer scripts from tools/:

cd tools
./installers/<installer>.sh

Available installer scripts:

Install	Description
BeamformIt	Beamforming tool for multi-channel speech enhancement.
cauchy_mult (state-spaces)	Cauchy multiplication kernels for state-space models (S4).
datasets	Hugging Face Datasets library.
DeepXi	Speech enhancement toolkit.
DiscreteSpeechMetrics	Metrics for discrete/unit-based speech models.
fairscale	Sharded training utilities for large models.
fairseq	Sequence modeling toolkit used by some recipes.
ffmpeg	Audio/video IO and conversion.
flash-attn	Fast attention CUDA kernels.
gss	Guided source separation.
gtn	Graph-based transducer networks library.
ice-g2p	Grapheme-to-phoneme for Icelandic.
k2	FSA toolkit for ASR/decoding.
KenLM	N-gram language modeling toolkit.
PyTorch Lightning	Training framework used by ESPnet3.
Longformer	Long-context transformer model.
loralib	LoRA adapters for fine-tuning.
Montreal Forced Aligner	Forced alignment for speech/text.
ParallelWaveGAN, pytsmod, miditoolkit, music21	Music/singing TTS toolchain dependencies.
mwerSegmenter	Segmenter for mWER scoring.
nkf	Japanese text encoding conversion.
OpenFace	Face analysis and landmarks.
ParallelWaveGAN	Neural vocoder.
PESQ	Speech quality metric (PESQ).
speech_tools, festival, espeak-ng, MBROLA	Phonemization backends.
py3mmseg	Japanese text segmentation.
pyopenjtalk	Japanese G2P / text frontend.
RawNet	Speaker verification model.
ReazonSpeech	ReazonSpeech dataset/tools.
s3prl	Self-supervised speech representations.
SCTK	Scoring toolkit (WER).
SimulEval	Simultaneous translation evaluation.
SpeechBrain	Speech toolkit (models/recipes).
sph2pipe	SPH to WAV conversion.
tdmelodic_openjtalk, pyopenjtalk	Japanese singing TTS frontends.
PyTorch	Core deep learning framework.
torch-optimizer	Extra optimizers for PyTorch.
torcheval	Metrics library for PyTorch.
transformers, soxr	Transformers models and audio resampling.
versa	VERSA toolkit (see repo for details).
vidaug	Video data augmentation.
visual deps	Visual/AV stack (OpenCV, ONNX, etc.).
warp-transducer	RNN-T CUDA extension.
espnet/whisper	ESPnet fork of Whisper for ASR.

Legacy conda setup

If you still rely on conda, the legacy setup script is available:

cd espnet/tools
. setup_anaconda.sh

Older guides may refer to this as setup-conda.sh. The workflow is the same, but setup_uv.sh is recommended for faster, modern installs.

CI-validated environments

The following environments are exercised in CI (.github/workflows/), which is the current "known to work" matrix. This is not an exhaustive compatibility guarantee, but a practical baseline.

OS / runner	Python	PyTorch	Notes
Ubuntu (ubuntu-latest)	3.10, 3.12	2.5.1, 2.7.1, 2.8.0, 2.9.1	ESPnet2/3 unit + integration tests
Debian 12 (container)	3.10	2.7.1	ESPnet2/3 tests in a Debian container
macOS (macOS-latest)	3.10	2.7.1	Install check with and without conda
Windows (Windows-latest)	3.10	2.7.1	Install check