espnet3.components.data.dataset.ShardedDataset
Less than 1 minute
espnet3.components.data.dataset.ShardedDataset
class espnet3.components.data.dataset.ShardedDataset
Bases: ABC, Dataset
Abstract base class for datasets that support sharding.
This interface is used when datasets are split into shards for parallel or distributed data loading. Any dataset subclassing ShardedDataset must implement the shard() method.
num_shards
Total number of shards in the dataset.
- Type: int
world_shard_size
Expected distributed world size when sharding.
- Type: int
NOTE
- This class is intended to be used with CombinedDataset in ESPnet.
- All datasets combined must subclass ShardedDataset if sharding is used.
Example
>>> class MyDataset(ShardedDataset):
... def __init__(self):
... self.num_shards = 8
... self.world_shard_size = 4
... def shard(self, idx):
... return Subset(self, shard_indices[idx])shard(idx: int)
Return a new dataset shard corresponding to the given index.
This method must be implemented by subclasses to return a subset of the data for sharded training or evaluation.
- Parameters:idx (int) – The index of the shard to return.
- Returns: A dataset instance representing the shard.
- Return type: Dataset
- Raises:NotImplementedError – Always in the base class. Must be overridden.
