ESPnet3 Model Configuration (Training)
ESPnet3 Model Configuration (Training)
This page explains how model and task in train.yaml map to model construction for the train / collect_stats stages.
Two modes: task (ESPnet2) vs model._target_ (custom)
Use ESPnet2-style models (task)
If you want to reuse an ESPnet2-derived model stack, set task and use an ESPnet2-style model: block.
task: espnet3.systems.asr.task.ASRTask
model:
encoder: transformer
decoder: transformer
# ...ESPnet2-style config...Tip: you can start from existing ESPnet2 configs under egs2/*/*/conf/*.yaml. See the ESPnet2 task reference for task names and links to the corresponding recipe docs.
Typical ASR model: keys in ESPnet2 configs:
| Key | Purpose |
|---|---|
encoder / encoder_conf | Encoder type and settings. |
decoder / decoder_conf | Decoder type and settings. |
model / model_conf | ASR model head and loss settings (CTC/attention, etc.). |
frontend / frontend_conf | Feature extraction (e.g., STFT/FBANK). |
specaug / specaug_conf | SpecAugment settings. |
normalize / normalize_conf | Feature normalization (e.g., global MVN). |
ESPnet2 task reference
Below is a quick reference to ESPnet2 task names and their recipe docs.
| Task | Description |
|---|---|
asr1 | Automatic Speech Recognition (Multi-tasking) |
asr2 | Automatic Speech Recognition with Discrete Units |
asvspoof1 | Speaker Verification Spoofing and Countermeasures |
cls1 | Classification |
codec1 | Speech Codec |
diar1 | Speaker Diarisation |
enh1 | Speech Enhancement |
enh_asr1 | Speech Recognition with Speech Enhancement |
enh_diar1 | Speaker Diarisation with Speech Enhancement |
enh_st1 | Speech-to-Text Translation with Speech Enhancement |
hubert1 | Self-supervised Learning |
lid1 | Language Identification |
lm1 | Language Modeling |
mt1 | Machine Translation |
s2st1 | Speech-to-Speech Translation |
s2t1 | Weakly-supervised Learning (Speech-to-Text) |
sds1 | ESPnet-SDS |
slu1 | Spoken Language Understanding |
speechlm1 | Speech Language Model |
spk1 | Speaker Representation |
ssl1 | Self-supervised Learning |
st1 | Speech-to-Text Translation |
svs1 | Singing Voice Synthesis |
svs2 | ESPnet2 SVS2 Recipe TEMPLATE |
tts1 | Text-to-Speech |
tts2 | Text-to-Speech with Discrete Units |
uasr1 | Unsupervised Automatic Speech Recognition |
Use custom/ESPnet3-only models (model._target_)
If you want an ESPnet3-specific or fully custom model, implement it under your recipe's src/ directory and point model._target_ to it:
model:
_target_: src.my_model.MyModel
# custom args hereTraining-time forward contract (common pattern)
For ASR-style training, the training wrapper typically expects your model to accept batch fields such as:
speech,speech_lengths,text,text_lengths
and return a tuple:
loss: scalar tensorstats: dict of scalars (logging only)weight: scalar tensor used as batch size for logging
Example:
class MyCustomModel:
def forward(self, speech, speech_lengths, text, text_lengths, **kwargs):
loss = ...
stats = {"loss": loss.detach()}
weight = speech.new_tensor(speech.shape[0])
return loss, stats, weightCollect-stats support (collect_feats)
If you want to use collect_stats, your model should implement collect_feats(). See:
- Stage doc:
doc/vuepress/src/espnet3/stages/collect-stats.md - Config doc:
doc/vuepress/src/espnet3/config/train_config.md
