ESPnet3 Installation
ESPnet3 Installation
ESPnet3 uses the same Python package name as ESPnet2: espnet.
Quick install (pip)
pip install espnetSystem extras (ASR/TTS/ST/ENH, etc.):
pip install "espnet[asr]"
pip install "espnet[tts]"
pip install "espnet[st]"
pip install "espnet[enh]"Supported extras (from pyproject.toml):
| Extra | Packages | Description |
|---|---|---|
asr | ctc-segmentation, editdistance, opt_einsum, jiwer | ASR alignment and scoring (e.g., WER). |
tts | pyworld, pypinyin, espnet_tts_frontend, g2p_en, jamo, jaconv | TTS frontends, G2P, and language processing. |
enh | ci_sdr, fast-bss-eval | Speech enhancement metrics. |
asr2 | editdistance | ESPnet2-style ASR extras. |
s2st | editdistance, s3prl | Speech-to-speech translation + SSL features. |
st | editdistance | Speech translation scoring. |
s2t | editdistance | Speech-to-text translation scoring. |
spk | asteroid_filterbanks | Speaker tasks. |
dev | black, flake8, pytest, pytest-cov, isort | Developer tooling (format/lint/test). |
test | pytest, pytest-timeouts, pytest-pythonpath, pytest-cov, hacking, mock, pycodestyle, jsondiff, flake8, flake8-docstrings, black, isort, h5py | Test stack used by CI. |
doc | sphinx, sphinx-rtd-theme, myst-parser, sphinx-argparse, sphinx-markdown-builder, sphinx-markdown-tables | Documentation build tools. |
all | espnet[asr], espnet[tts], espnet[enh], espnet[spk], fairscale, transformers, evaluate | Convenience meta extra. |
Using uv
uv is fast and works well for reproducible Python environments. It makes it easy to pin Python versions and manage isolated virtualenvs without system-wide installs.
uv venv .venv
. .venv/bin/activate
uv pip install espnetFor extras:
uv pip install "espnet[asr]"Using pixi
Pixi can manage Python and system dependencies together. This lets you install packages that previously required conda-forge entirely in user space, without system-level package managers.
pixi init
pixi add python=3.10 pip
pixi run pip install espnetInstall from source (recommended for development)
git clone https://github.com/espnet/espnet.git
cd espnet/tools
. setup_uv.shThen install the editable package with extras as needed:
cd ..
uv pip install -e ".[asr]"Recipe tool installers (optional)
Some recipes rely on external tools (e.g., sph2pipe). If you need them, refer to the installer scripts under tools/installers/. After creating your env with uv or pixi, you can run the installer scripts from tools/:
cd tools
./installers/<installer>.shAvailable installer scripts:
| Install | Description |
|---|---|
| BeamformIt | Beamforming tool for multi-channel speech enhancement. |
| cauchy_mult (state-spaces) | Cauchy multiplication kernels for state-space models (S4). |
| datasets | Hugging Face Datasets library. |
| DeepXi | Speech enhancement toolkit. |
| DiscreteSpeechMetrics | Metrics for discrete/unit-based speech models. |
| fairscale | Sharded training utilities for large models. |
| fairseq | Sequence modeling toolkit used by some recipes. |
| ffmpeg | Audio/video IO and conversion. |
| flash-attn | Fast attention CUDA kernels. |
| gss | Guided source separation. |
| gtn | Graph-based transducer networks library. |
| ice-g2p | Grapheme-to-phoneme for Icelandic. |
| k2 | FSA toolkit for ASR/decoding. |
| KenLM | N-gram language modeling toolkit. |
| PyTorch Lightning | Training framework used by ESPnet3. |
| Longformer | Long-context transformer model. |
| loralib | LoRA adapters for fine-tuning. |
| Montreal Forced Aligner | Forced alignment for speech/text. |
| ParallelWaveGAN, pytsmod, miditoolkit, music21 | Music/singing TTS toolchain dependencies. |
| mwerSegmenter | Segmenter for mWER scoring. |
| nkf | Japanese text encoding conversion. |
| OpenFace | Face analysis and landmarks. |
| ParallelWaveGAN | Neural vocoder. |
| PESQ | Speech quality metric (PESQ). |
| speech_tools, festival, espeak-ng, MBROLA | Phonemization backends. |
| py3mmseg | Japanese text segmentation. |
| pyopenjtalk | Japanese G2P / text frontend. |
| RawNet | Speaker verification model. |
| ReazonSpeech | ReazonSpeech dataset/tools. |
| s3prl | Self-supervised speech representations. |
| SCTK | Scoring toolkit (WER). |
| SimulEval | Simultaneous translation evaluation. |
| SpeechBrain | Speech toolkit (models/recipes). |
| sph2pipe | SPH to WAV conversion. |
| tdmelodic_openjtalk, pyopenjtalk | Japanese singing TTS frontends. |
| PyTorch | Core deep learning framework. |
| torch-optimizer | Extra optimizers for PyTorch. |
| torcheval | Metrics library for PyTorch. |
| transformers, soxr | Transformers models and audio resampling. |
| versa | VERSA toolkit (see repo for details). |
| vidaug | Video data augmentation. |
| visual deps | Visual/AV stack (OpenCV, ONNX, etc.). |
| warp-transducer | RNN-T CUDA extension. |
| espnet/whisper | ESPnet fork of Whisper for ASR. |
Legacy conda setup
If you still rely on conda, the legacy setup script is available:
cd espnet/tools
. setup_anaconda.shOlder guides may refer to this as setup-conda.sh. The workflow is the same, but setup_uv.sh is recommended for faster, modern installs.
CI-validated environments
The following environments are exercised in CI (.github/workflows/), which is the current "known to work" matrix. This is not an exhaustive compatibility guarantee, but a practical baseline.
| OS / runner | Python | PyTorch | Notes |
|---|---|---|---|
| Ubuntu (ubuntu-latest) | 3.10, 3.12 | 2.5.1, 2.7.1, 2.8.0, 2.9.1 | ESPnet2/3 unit + integration tests |
| Debian 12 (container) | 3.10 | 2.7.1 | ESPnet2/3 tests in a Debian container |
| macOS (macOS-latest) | 3.10 | 2.7.1 | Install check with and without conda |
| Windows (Windows-latest) | 3.10 | 2.7.1 | Install check |
