espnet3.components.data.dataset.CombinedDataset
espnet3.components.data.dataset.CombinedDataset
class espnet3.components.data.dataset.CombinedDataset(datasets: List[Any], transforms: List[Tuple[Callable, Callable]], use_espnet_preprocessor: bool = False)
Bases: object
Combines multiple datasets into a single unified dataset-like interface.
This class supports seamless access to multiple datasets as if they were one. Each dataset can be paired with a transform and a global preprocessor, which are applied sequentially to each sample. It also supports optional UID handling for ESPnet-style preprocessing.
Indexing modes. : * Numeric mode (default): every underlying dataset accepts integer indices and the combined dataset behaves like a contiguous sequence.
String mode: if any dataset requires string-based utterance IDs, the organizer builds a lookup table mapping every UID to its source dataset while preserving DataLoader-friendly integer access.
Parameters:
datasets (List *[*Any ]) – A list of dataset instances. Each must implement __getitem__ and __len__.
transforms (List *[*Tuple *[*Callable , Callable ] ]) –
A list of (transform, preprocessor) tuples. Each pair corresponds to the matching dataset in datasets.
- transform(sample) is applied first.
- Then preprocessor(uid, sample) or preprocessor(sample) is applied,
depending on use_espnet_preprocessor.
use_espnet_preprocessor (bool) – If True, applies the preprocessor as preprocessor(uid, sample). This is used for ESPnet AbsPreprocessor compatible pipelines.
NOTE
At initialization, the first sample from each dataset is passed through its associated transform to check that all datasets produce dictionaries with the same set of keys. This ensures consistency across the combined dataset. An AssertionError is raised if the keys differ.
- Raises:
- IndexError – If a requested index is outside the range of the combined dataset.
- ValueError – If index is a non-integer string that none of the underlying datasets accept as an utterance ID.
- RuntimeError – If get_text() or shard() is called but not supported.
- AssertionError – If output keys from different datasets are inconsistent.
Example
>>> dataset = CombinedDataset(
... datasets=[ds1, ds2],
... transforms=[
... (transform1, preprocessor),
... (transform2, preprocessor),
... ],
... use_espnet_preprocessor=True
... )
>>> sample = dataset[5]
>>> print(sample["text"])Initialize CombinedDataset object.
get_text(idx)
Retrieve the target text string for a given index.
This method delegates to the underlying dataset’s get_text(idx) method. It is typically used for extracting text sequences for purposes such as training tokenizers or language models.
- Raises:RuntimeError – If not all datasets implement get_text(idx).
shard(shard_idx: int)
Return a sharded version of the combined dataset.
This is used when handling large datasets that are split into shards for efficiency and distributed processing (ESPnet multiple-iterator mode). All datasets must be subclasses of espnet3.data.dataset.ShardedDataset, and implement a shard() method.
- Parameters:shard_idx (int) – Index of the shard to retrieve.
- Returns: A new CombinedDataset containing the sharded datasets.
- Return type:CombinedDataset
- Raises:RuntimeError – If any dataset does not support sharding.
property use_espnet_collator
Get the flag indicating whether to use ESPnet collator.
