ESPnet3 Model Configuration (Training)

Masao SomekiAbout 2 min

ESPnet3 Model Configuration (Training)

This page explains how model and task in train.yaml map to model construction for the train / collect_stats stages.

Two modes: `task` (ESPnet2) vs `model._target_` (custom)

Use ESPnet2-style models (`task`)

If you want to reuse an ESPnet2-derived model stack, set task and use an ESPnet2-style model: block.

task: espnet3.systems.asr.task.ASRTask
model:
  encoder: transformer
  decoder: transformer
  # ...ESPnet2-style config...

Tip: you can start from existing ESPnet2 configs under egs2/*/*/conf/*.yaml. See the ESPnet2 task reference for task names and links to the corresponding recipe docs.

Typical ASR model: keys in ESPnet2 configs:

Key	Purpose
`encoder` / `encoder_conf`	Encoder type and settings.
`decoder` / `decoder_conf`	Decoder type and settings.
`model` / `model_conf`	ASR model head and loss settings (CTC/attention, etc.).
`frontend` / `frontend_conf`	Feature extraction (e.g., STFT/FBANK).
`specaug` / `specaug_conf`	SpecAugment settings.
`normalize` / `normalize_conf`	Feature normalization (e.g., global MVN).

ESPnet2 task reference

Below is a quick reference to ESPnet2 task names and their recipe docs.

Task	Description
`asr1`	Automatic Speech Recognition (Multi-tasking)
`asr2`	Automatic Speech Recognition with Discrete Units
`asvspoof1`	Speaker Verification Spoofing and Countermeasures
`cls1`	Classification
`codec1`	Speech Codec
`diar1`	Speaker Diarisation
`enh1`	Speech Enhancement
`enh_asr1`	Speech Recognition with Speech Enhancement
`enh_diar1`	Speaker Diarisation with Speech Enhancement
`enh_st1`	Speech-to-Text Translation with Speech Enhancement
`hubert1`	Self-supervised Learning
`lid1`	Language Identification
`lm1`	Language Modeling
`mt1`	Machine Translation
`s2st1`	Speech-to-Speech Translation
`s2t1`	Weakly-supervised Learning (Speech-to-Text)
`sds1`	ESPnet-SDS
`slu1`	Spoken Language Understanding
`speechlm1`	Speech Language Model
`spk1`	Speaker Representation
`ssl1`	Self-supervised Learning
`st1`	Speech-to-Text Translation
`svs1`	Singing Voice Synthesis
`svs2`	ESPnet2 SVS2 Recipe TEMPLATE
`tts1`	Text-to-Speech
`tts2`	Text-to-Speech with Discrete Units
`uasr1`	Unsupervised Automatic Speech Recognition

Use custom/ESPnet3-only models (`model._target_`)

If you want an ESPnet3-specific or fully custom model, implement it under your recipe's src/ directory and point model._target_ to it:

model:
  _target_: src.my_model.MyModel
  # custom args here

Training-time forward contract (common pattern)

For ASR-style training, the training wrapper typically expects your model to accept batch fields such as:

speech, speech_lengths, text, text_lengths

and return a tuple:

loss: scalar tensor
stats: dict of scalars (logging only)
weight: scalar tensor used as batch size for logging

Example:

class MyCustomModel:
    def forward(self, speech, speech_lengths, text, text_lengths, **kwargs):
        loss = ...
        stats = {"loss": loss.detach()}
        weight = speech.new_tensor(speech.shape[0])
        return loss, stats, weight

Collect-stats support (`collect_feats`)

If you want to use collect_stats, your model should implement collect_feats(). See:

Stage doc: doc/vuepress/src/espnet3/stages/collect-stats.md
Config doc: doc/vuepress/src/espnet3/config/train_config.md

ESPnet3 Model Configuration (Training)

ESPnet3 Model Configuration (Training)

Two modes: task (ESPnet2) vs model._target_ (custom)

Use ESPnet2-style models (task)

ESPnet2 task reference

Use custom/ESPnet3-only models (model._target_)

Training-time forward contract (common pattern)

Collect-stats support (collect_feats)

Two modes: `task` (ESPnet2) vs `model._target_` (custom)

Use ESPnet2-style models (`task`)

Use custom/ESPnet3-only models (`model._target_`)

Collect-stats support (`collect_feats`)