ESPnet3 Collect Stats Stage
ESPnet3 Collect Stats Stage
collect_stats computes shape files and feature statistics used by later training steps.
The stage uses training.yaml, not a separate config.
Quick usage
Run
python run.py --stages collect_stats --training_config conf/training.yamlThis runs collect_stats over the train and valid splits and writes outputs under stats_dir/train and stats_dir/valid.
Configure (in training.yaml)
collect_stats reads the same training.yaml used for training. At minimum:
stats_dirmust be setdatasetanddataloaderdefine the splits and batchingmodel.normalize_conf.stats_fileoften points to the produced stats file
Example:
stats_dir: ${exp_dir}/stats
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
recipe_dir: ${recipe_dir}
train:
- name: train
data_src: mini_an4/asr
data_src_args:
split: train
data_path: ${dataset_dir}
valid:
- name: valid
data_src: mini_an4/asr
data_src_args:
split: valid
data_path: ${dataset_dir}
dataloader:
train:
iter_factory:
batches:
shape_files:
- ${stats_dir}/train/feats_shape
model:
normalize: global_mvn
normalize_conf:
stats_file: ${stats_dir}/train/feats_stats.npzWhat it reads
The stage consumes:
datasetdataloadermodelstats_dirparallelwhen configured
Only train and valid splits are used.
Outputs
Typical outputs:
${stats_dir}/
train/
feats_shape
feats_stats.npz
stats_keys
valid/
feats_shape
feats_stats.npz
stats_keysNotes:
collect_statsonly processestrainandvalid;testis ignored- during
collect_stats,model.normalize_conf.stats_fileis not read as an input source of truth; stats are written understats_dir
Model requirement
The model must support collect_feats(...).
ESPnet task-backed models already do this. Custom models should provide a compatible collect_feats() implementation returning feature tensors and, when needed, matching *_lengths.
Developer notes
What runs under the hood
collect_stats builds the model and trainer, then calls the trainer-side stats-collection path.
The important model contract is collect_feats(...). Task-backed models already provide this. Custom models should return a dict of tensors keyed by feature name, plus any *_lengths entries needed by the batching logic.
Minimal conceptual example:
class MyCustomModel:
def collect_feats(self, speech, speech_lengths, **kwargs):
feats = speech
feats_lengths = speech_lengths
return {"feats": feats, "feats_lengths": feats_lengths}This is an ASR-style example, but the same rule applies to any task: return the features whose statistics should be accumulated, plus lengths when batching depends on them.
For more background on why these files exist and how they are reused later, see Collect stats overview.
