espnet2.train.iterable_dataset.SplicedIterableESPnetDataset

About 2 min

espnet2.train.iterable_dataset.SplicedIterableESPnetDataset

class espnet2.train.iterable_dataset.SplicedIterableESPnetDataset(path_name_type_list: Collection[Tuple[str, str, str]], preprocess: Callable[[str, Dict[str, ndarray]], Dict[str, ndarray]] | None = None, key_file: str | None = None, **kwargs)

Bases: IterableDataset

A data iterator that is spliced from multiple IterableESPnetDataset.

This class enables the combination of multiple IterableESPnetDataset instances into a single iterable dataset. It facilitates handling multiple data sources while maintaining a consistent interface.

data_iterators

List of data iterators from IterableESPnetDataset.

Type: List[IterableESPnetDataset]

task_map

Mapping of each dataset iterator to its associated task name.

Type: Dict[IterableESPnetDataset, str]

speaker_prompt_config

Configuration for speaker prompts, if applicable.

Type: Dict[IterableESPnetDataset, Dict]
Parameters:
- path_name_type_list (Collection *[*Tuple *[*str , str , str ] ]) – A collection of tuples where each tuple contains the path to a JSON file, a name, and the type (should be “json”).
- preprocess (Callable [ *[*str , Dict *[*str , np.ndarray ] ] , Dict *[*str , np.ndarray ] ]) – Optional preprocessing function that takes a string and a dictionary and returns a modified dictionary.
- key_file (str , optional) – An optional path to a key file containing valid keys for data examples.
- **kwargs – Additional keyword arguments to be passed to the IterableESPnetDataset.

########### Examples

>>> dataset = SplicedIterableESPnetDataset(
...     path_name_type_list=[('data.json', 'task_name', 'json')],
...     preprocess=my_preprocess_function,
...     key_file='key_file.txt'
... )
>>> for uid, data in dataset:
...     print(uid, data)

NOTE

The input JSON files must follow a specific structure, containing a “data_files” key that lists the paths to data files and their modalities and types. Additionally, the “examples” key should list valid keys for the dataset.

Raises:
- AssertionError – If any of the triplets in path_name_type_list are not of type “json”.
- RuntimeError – If there are issues reading the files or mismatched keys.

has_name(name) → bool

Check if a given name exists in the dataset.

This method checks if the specified name is present in the debug information dictionary of the dataset. This can be useful for verifying the presence of a particular data key before attempting to access it.

Parameters:name (str) – The name to check for existence in the dataset.
Returns: True if the name exists in the dataset, False otherwise.
Return type: bool

########### Examples

>>> dataset = SplicedIterableESPnetDataset([('path/to/data.json', 'data', 'json')])
>>> dataset.has_name('data')
True
>>> dataset.has_name('nonexistent_name')
False

names() → Tuple[str, ...]

Iterable dataset module.

This module provides the implementation of the IterableESPnetDataset and SplicedIterableESPnetDataset classes, which are designed for use with ESPnet, a toolkit for end-to-end speech processing.

The IterableESPnetDataset class represents an iterable dataset that can load data from various sources defined in a list of tuples containing path, name, and data type. The SplicedIterableESPnetDataset class allows for splicing multiple IterableESPnetDataset instances, enabling multi-task training.

DATA_TYPES

A dictionary mapping data types to their respective loading functions.

Type: dict

########### Examples

>>> dataset = IterableESPnetDataset([('wav.scp', 'input', 'sound'),
...                                  ('token_int', 'output', 'text_int')],
...                                )
>>> for uid, data in dataset:
...     data
{'input': per_utt_array, 'output': per_utt_array}

post_process(data: Dict, iterator: IterableESPnetDataset)

Post-processes the data after loading from the iterator.

This method modifies the input data dictionary by adding a speaker prompt if configured. It ensures that the speaker prompt is a dummy prompt with the correct length.

Parameters:
- data (Dict) – A dictionary containing the data loaded from the dataset.
- iterator (IterableESPnetDataset) – The iterator from which the data was loaded, used to access speaker prompt configuration.
Returns: The modified data dictionary with the added speaker prompt.
Return type: Dict

########### Examples

>>> dataset = SplicedIterableESPnetDataset([...])
>>> for uid, data in dataset:
...     processed_data = dataset.post_process(data, dataset.data_iterators[0])
...     print(processed_data)

NOTE

The speaker prompt configuration is expected to be present in the speaker_prompt_config attribute of the instance.