espnet3.components.data.collect_stats.collect_stats
espnet3.components.data.collect_stats.collect_stats
espnet3.components.data.collect_stats.collect_stats(model_config, dataset_config, dataloader_config, mode: str, output_dir: Path, task: str | None = None, parallel_config: DictConfig | None = None, write_collected_feats: bool = False, batch_size: int = 4)
Collect dataset statistics used by feature normalization stages.
This is the public entry point for espnet3 collect-stats execution. It builds batches from the selected dataset split, runs model.collect_feats(...) over the full split, and writes aggregated *_stats.npz files under output_dir / mode. When requested, it also writes SCP-backed collected feature dumps under collect_feats/.
- Parameters:
- model_config – Configuration object used to instantiate the model that extracts features from the input examples.
- dataset_config – Configuration of the dataset organizer providing the split specified by
mode. - dataloader_config – Dataloader configuration. If
<mode>containsmultiple_iterator, this function raises because espnet3 does not support that mode here. - mode – Name of the dataset split to process (
trainorvalid). - output_dir – Directory where aggregated statistics and optionally collected features are written.
- task – Name of the ESPnet task. If
None,model_configshould be directly instantiable. - parallel_config – Configuration for parallel execution.
- write_collected_feats – Whether to persist the raw collected features.
- batch_size – Number of dataset items processed per batch.
- Returns: This function writes outputs to disk and does not return the aggregated arrays.
- Return type: None
- Raises:RuntimeError – If the selected dataloader mode uses
multiple_iterator.
Notes
Output files are written under output_dir / mode. For each feature key, the function writes {key}_stats.npz with count, sum, and sum_square arrays. It also writes a stats_keys file listing the aggregated feature keys.
Examples
collect_stats( : model_config=model_config, dataset_config=dataset_config, dataloader_config=dataloader_config, mode=”train”, output_dir=Path(“exp/asr_stats”), task=”asr”, batch_size=8,
)
