espnet3.components.data.dataset.CombinedDataset
espnet3.components.data.dataset.CombinedDataset
class espnet3.components.data.dataset.CombinedDataset(datasets: List[Any], transforms: List[Tuple[Callable, Callable]], use_espnet_preprocessor: bool = False)
Bases: object
Combines multiple datasets into a single unified dataset-like interface.
This class supports seamless access to multiple datasets as if they were one. Each dataset can be paired with a transform and a global preprocessor, which are applied sequentially to each sample. It also supports optional UID handling for ESPnet-style preprocessing.
Indexing modes. : * Numeric mode (default): every underlying dataset accepts integer indices and the combined dataset behaves like a contiguous sequence.
String mode: if any dataset requires string-based utterance IDs, the organizer builds a lookup table mapping every UID to its source dataset while preserving DataLoader-friendly integer access.
Parameters:
datasets (List *[*Any ]) – A list of dataset instances. Each must implement
__getitem__and__len__.transforms (List *[*Tuple *[*Callable , Callable ] ]) –
A list of (transform, preprocessor) tuples. Each pair corresponds to the matching dataset in
datasets.transform(sample)is applied first.- Then
preprocessor(uid, sample)orpreprocessor(sample)is applied,
depending on
use_espnet_preprocessor.use_espnet_preprocessor (bool) – If True, applies the preprocessor as
preprocessor(uid, sample). This is used for ESPnetAbsPreprocessorcompatible pipelines.
NOTE
At initialization, the first sample from each dataset is passed through its associated transform to check that all datasets produce dictionaries with the same set of keys. This ensures consistency across the combined dataset. An AssertionError is raised if the keys differ.
- Raises:
- IndexError – If a requested index is outside the range of the combined dataset.
- ValueError – If index is a non-integer string that none of the underlying datasets accept as an utterance ID.
- RuntimeError – If
shard()is called but not supported. - AssertionError – If output keys from different datasets are inconsistent.
Example
>>> dataset = CombinedDataset(
... datasets=[ds1, ds2],
... transforms=[
... (transform1, preprocessor),
... (transform2, preprocessor),
... ],
... use_espnet_preprocessor=True
... )
>>> sample = dataset[5]
>>> print(sample["text"])Initialize CombinedDataset object.
shard(shard_idx: int)
Return a sharded version of the combined dataset.
This is used when handling large datasets that are split into shards for efficiency and distributed processing (ESPnet multiple-iterator mode). All datasets must be subclasses of espnet3.data.dataset.ShardedDataset, and implement a shard() method.
- Parameters:shard_idx (int) – Index of the shard to retrieve.
- Returns: A new CombinedDataset containing the sharded datasets.
- Return type:CombinedDataset
- Raises:RuntimeError – If any dataset does not support sharding.
property use_espnet_collator
Get the flag indicating whether to use ESPnet collator.
