ESPnet3 Train Dataset
ESPnet3 Train Dataset
ESPnet3 expects a DataOrganizer-based dataset config for training. The typical flow is:
- Write a dataset class (any backend is fine).
- Configure it under
datasetintrain.yaml. - Instantiate via Hydra and iterate train/valid/test.
1) Write your dataset class
You can build datasets using whatever backend you prefer, such as Hugging Face Datasets, Lhotse, or Arkive.
Here is a minimal ASR-style example. The dataset receives manifest_path from the config, loads entries once, and indexes into them in __getitem__.
class MiniAN4Dataset:
def __init__(self, manifest_path):
self.manifest_path = Path(manifest_path)
self._entries = _read_manifest(self.manifest_path)
def __getitem__(self, idx):
entry = self._entries[int(idx)]
return {
"speech": np.asarray(entry["array"], dtype=np.float32),
"text": entry["text"],
}
def __len__(self):
# Example: return 100 if the manifest has 100 entries.
return len(self._entries)2) Configure it in train.yaml
Each list item maps to a DatasetConfig entry. DataOrganizer will:
- Combine
trainandvalidlists into per-split datasets. - Keep
testas named datasets for inference/evaluation.
dataset:
_target_: espnet3.components.data.data_organizer.DataOrganizer
train:
- name: train_nodev
dataset:
_target_: src.dataset.MiniAN4Dataset
manifest_path: ${dataset_dir}/manifest/train_nodev.tsv
- name: train_2
dataset:
_target_: src.dataset.MiniAN4Dataset
manifest_path: ${dataset_dir}/manifest/train_2.tsv
valid:
- name: train_dev
dataset:
_target_: src.dataset.MiniAN4Dataset
manifest_path: ${dataset_dir}/manifest/train_dev.tsv
- name: dev_2
dataset:
_target_: src.dataset.MiniAN4Dataset
manifest_path: ${dataset_dir}/manifest/dev_2.tsv
test:
- name: test
dataset:
_target_: src.dataset.MiniAN4Dataset
manifest_path: ${dataset_dir}/manifest/test.tsv
- name: test_2
dataset:
_target_: src.dataset.MiniAN4Dataset
manifest_path: ${dataset_dir}/manifest/test_2.tsv
preprocessor:
_target_: espnet2.train.preprocessor.CommonPreprocessor
token_type: bpe
token_list: ${tokenizer.save_path}/tokens.txt
bpemodel: ${tokenizer.save_path}/bpe.modelNotes:
trainandvalidmust be both present or both omitted.testis optional and is typically used byinfer/measure, not bytrain.preprocessoris used when you want to reuse ESPnet2 preprocessors. Choose from the implementations in espnet2/train/preprocessor.py or implement your own with the same input/output contract.
3) Instantiate and use in Python
This is the basic way to use DataOrganizer in Python: instantiate the config, iterate each split, and access named test sets when needed.
from hydra.utils import instantiate
from omegaconf import OmegaConf
cfg = OmegaConf.load("conf/train.yaml")
organizer = instantiate(cfg.dataset)Train split
Iterate the training data. If you listed multiple train datasets in the config, DataOrganizer joins them into one long list, so you can treat it as a single dataset. In this example, each train dataset has 100 items, so len(organizer.train) becomes 200.
for sample in organizer.train:
print(sample)
break
# Example:
# {"speech": np.ndarray(...), "text": "SOME TEXT"}
len(organizer.train) == 200Valid split
Same idea as train, but for validation data. With two 100-item valid datasets, len(organizer.valid) becomes 200.
for sample in organizer.valid:
pass
len(organizer.valid) == 200Test sets
Test sets are kept separately by name, so you can pick a specific test set or loop over all of them.
for name, test_set in organizer.test.items():
for sample in test_set:
pass
for sample in organizer.test["test"]:
pass
for sample in organizer.test["test_2"]:
pass
len(organizer.test["test"]) == 100
len(organizer.test["test_2"]) == 100UID + sample mode
If you use ESPnet's collate function, ESPnet3 automatically switches to (uid, sample) pairs for train/valid.
organizer.train.use_espnet_collator = True
organizer.valid.use_espnet_collator = True
for uid, sample in organizer.train:
pass