espnet2.train.dataset.ESPnetMultiTaskDataset

About 2 min

espnet2.train.dataset.ESPnetMultiTaskDataset

class espnet2.train.dataset.ESPnetMultiTaskDataset(path_name_type_list: Collection[Tuple[str, str, str]], key_file: str | None = None, **kwargs)

Bases: AbsDataset

ESPnetMultiTaskDataset is the top-level Dataset object that manages multiple

EspnetSpeechLMDataset instances, each serving a specific task and dataset. This class queries all the EspnetSpeechLMDataset instances and combines examples from different tasks for multi-task training. It is typically used in ESPnet SpeechLM models. For detailed usage, refer to: <espnet>/egs2/TEMPLATE/speechlm1#data-loading-and-preprocessing.

key_dict

A dictionary mapping example IDs to None.

Type: Optional[Dict[str, None]]

iterator_map

A mapping of example IDs to their respective datasets.

Type: Dict[str, EspnetSpeechLMDataset]

datasets

A list of dataset instances.

Type: List[EspnetSpeechLMDataset]
Parameters:
- path_name_type_list (Collection *[*Tuple *[*str , str , str ] ]) – A collection of tuples, each containing the path to the dataset, a name, and the type of dataset.
- key_file (str , optional) – A path to a file containing keys to filter examples. Defaults to None.
- **kwargs – Additional keyword arguments to pass to the dataset constructors.

######### Examples

>>> dataset = ESPnetMultiTaskDataset(
...     path_name_type_list=[
...         ('dataset1.json', 'dataset1', 'dataset_json'),
...         ('dataset2.json', 'dataset2', 'dataset_json')
...     ],
...     key_file='keys.txt'
... )
>>> uid, data = dataset['task_example_id']

Raises:AssertionError – If a non-JSON triplet is encountered in path_name_type_list.

NOTE

The example_list is used for sub-datasets without a task prefix.

has_name(name) → bool

Checks if the given name is present in the dataset.

Parameters:name (str) – The name to check for existence in the dataset.
Returns: True if the name exists in the dataset, False otherwise.
Return type: bool

######### Examples

>>> dataset = ESPnetMultiTaskDataset(...)
>>> dataset.has_name('example_name')
True
>>> dataset.has_name('nonexistent_name')
False

names() → Tuple[str, ...]

ESPnetMultiTaskDataset is the top-level Dataset object that manages multiple

EspnetSpeechLMDataset objects, each serving a specific task and dataset. This object queries all these EspnetSpeechLMDataset instances and combines examples from different tasks for multi-task training. Typically, this dataset is used in ESPnet SpeechLM models.

See details in: <espnet>/egs2/TEMPLATE/speechlm1#data-loading-and-preprocessing

key_dict

A dictionary mapping example keys to None, used for filtering examples based on a key file.

Type: dict

iterator_map

A mapping from example identifiers (with task prefixes) to their corresponding dataset instances.

Type: dict

datasets

A list of EspnetSpeechLMDataset instances.

Type: list
Parameters:
- path_name_type_list (Collection *[*Tuple *[*str , str , str ] ]) – A collection of tuples, each containing the path to a dataset, a name, and a type.
- key_file (str , optional) – A path to a key file for filtering examples. Defaults to None.
- **kwargs – Additional keyword arguments to pass to the dataset constructor.
Returns: None

######### Examples

>>> dataset = ESPnetMultiTaskDataset(
...     path_name_type_list=[
...         ("path/to/dataset1.json", "dataset1", "dataset_json"),
...         ("path/to/dataset2.json", "dataset2", "dataset_json"),
...     ],
...     key_file="path/to/key_file.txt"
... )
>>> uid, data = dataset["task1_example_id"]
>>> print(data)

Raises:
- AssertionError – If a triplet in path_name_type_list is not of type “dataset_json”.
- FileNotFoundError – If the specified key file or dataset JSON file is not found.

NOTE

This class provides an interface for managing multiple datasets in a structured manner, allowing for efficient data retrieval and processing across different tasks.