espnet2.train.dataset.ESPnetMultiTaskDataset
espnet2.train.dataset.ESPnetMultiTaskDataset
class espnet2.train.dataset.ESPnetMultiTaskDataset(path_name_type_list: Collection[Tuple[str, str, str]], key_file: str | None = None, **kwargs)
Bases: AbsDataset
ESPnetMultiTaskDataset is the top-level Dataset object that manages multiple
EspnetSpeechLMDataset instances, each serving a specific task and dataset. This class queries all the EspnetSpeechLMDataset instances and combines examples from different tasks for multi-task training. It is typically used in ESPnet SpeechLM models. For detailed usage, refer to: <espnet>/egs2/TEMPLATE/speechlm1#data-loading-and-preprocessing.
key_dict
A dictionary mapping example IDs to None.
- Type: Optional[Dict[str, None]]
iterator_map
A mapping of example IDs to their respective datasets.
- Type: Dict[str, EspnetSpeechLMDataset]
datasets
A list of dataset instances.
Type: List[EspnetSpeechLMDataset]
Parameters:
- path_name_type_list (Collection *[*Tuple *[*str , str , str ] ]) – A collection of tuples, each containing the path to the dataset, a name, and the type of dataset.
- key_file (str , optional) – A path to a file containing keys to filter examples. Defaults to None.
- **kwargs – Additional keyword arguments to pass to the dataset constructors.
######### Examples
>>> dataset = ESPnetMultiTaskDataset(
... path_name_type_list=[
... ('dataset1.json', 'dataset1', 'dataset_json'),
... ('dataset2.json', 'dataset2', 'dataset_json')
... ],
... key_file='keys.txt'
... )
>>> uid, data = dataset['task_example_id']
- Raises:AssertionError – If a non-JSON triplet is encountered in path_name_type_list.
NOTE
The example_list is used for sub-datasets without a task prefix.
has_name(name) → bool
Checks if the given name is present in the dataset.
- Parameters:name (str) – The name to check for existence in the dataset.
- Returns: True if the name exists in the dataset, False otherwise.
- Return type: bool
######### Examples
>>> dataset = ESPnetMultiTaskDataset(...)
>>> dataset.has_name('example_name')
True
>>> dataset.has_name('nonexistent_name')
False
names() → Tuple[str, ...]
ESPnetMultiTaskDataset is the top-level Dataset object that manages multiple
EspnetSpeechLMDataset objects, each serving a specific task and dataset. This object queries all these EspnetSpeechLMDataset instances and combines examples from different tasks for multi-task training. Typically, this dataset is used in ESPnet SpeechLM models.
See details in: <espnet>/egs2/TEMPLATE/speechlm1#data-loading-and-preprocessing
key_dict
A dictionary mapping example keys to None, used for filtering examples based on a key file.
- Type: dict
iterator_map
A mapping from example identifiers (with task prefixes) to their corresponding dataset instances.
- Type: dict
datasets
A list of EspnetSpeechLMDataset instances.
Type: list
Parameters:
- path_name_type_list (Collection *[*Tuple *[*str , str , str ] ]) – A collection of tuples, each containing the path to a dataset, a name, and a type.
- key_file (str , optional) – A path to a key file for filtering examples. Defaults to None.
- **kwargs – Additional keyword arguments to pass to the dataset constructor.
Returns: None
######### Examples
>>> dataset = ESPnetMultiTaskDataset(
... path_name_type_list=[
... ("path/to/dataset1.json", "dataset1", "dataset_json"),
... ("path/to/dataset2.json", "dataset2", "dataset_json"),
... ],
... key_file="path/to/key_file.txt"
... )
>>> uid, data = dataset["task1_example_id"]
>>> print(data)
- Raises:
- AssertionError – If a triplet in path_name_type_list is not of type “dataset_json”.
- FileNotFoundError – If the specified key file or dataset JSON file is not found.
NOTE
This class provides an interface for managing multiple datasets in a structured manner, allowing for efficient data retrieval and processing across different tasks.