espnet2.iterators.chunk_iter_factory.ChunkIterFactory

About 3 min

espnet2.iterators.chunk_iter_factory.ChunkIterFactory

class espnet2.iterators.chunk_iter_factory.ChunkIterFactory(dataset, batch_size: int, batches: AbsSampler | Sequence[Sequence[Any]], chunk_length: int | str, chunk_shift_ratio: float = 0.5, num_cache_chunks: int = 1024, num_samples_per_epoch: int | None = None, seed: int = 0, shuffle: bool = False, num_workers: int = 0, collate_fn=None, pin_memory: bool = False, excluded_key_prefixes: List[str] | None = None, discard_short_samples: bool = True, default_fs: int | None = None, chunk_max_abs_length: int | None = None)

Bases: AbsIterFactory

Creates chunks from a sequence.

This class implements a chunk iterator factory that generates chunks from a sequence dataset. It supports varying chunk lengths and can discard short samples based on user-defined parameters.

######### Examples

>>> batches = [["id1"], ["id2"], ...]
>>> batch_size = 128
>>> chunk_length = 1000
>>> iter_factory = ChunkIterFactory(dataset, batches, batch_size,
...                                  chunk_length)
>>> it = iter_factory.build_iter(epoch)
>>> for ids, batch in it:
...     ...

Notes

The number of mini-batches varies in each epoch, and it is not possible to know the count in advance because the IterFactory does not receive length information.
Due to this, num_iters_per_epoch cannot be implemented for this iterator. Instead, num_samples_per_epoch is implemented.

batch_size

The size of each mini-batch.

Type: int

chunk_lengths

List of valid chunk lengths.

Type: List[int]

chunk_shift_ratio

The ratio for shifting chunks.

Type: float

chunk_max_abs_length

Maximum absolute length of chunks.

Type: int

num_cache_chunks

Number of cached chunks for processing.

Type: int

excluded_key_pattern

Regex pattern for excluding keys from length consistency checks.

Type: str

discard_short_samples

Whether to discard samples shorter than the shortest chunk length.

Type: bool

default_fs

Default sampling frequency used to decide the chunk length.

Type: Optional[int]

collate_fn

Function for collating batches.

Type: Optional[callable]
Parameters:
- dataset (Any) – The dataset to iterate over.
- batches (Union [AbsSampler , Sequence *[*Sequence *[*Any ] ] ]) – Batches to process.
- chunk_length (Union *[*int , str ]) – Length of chunks or a string specifying multiple lengths.
- chunk_shift_ratio (float , optional) – Default is 0.5.
- num_cache_chunks (int , optional) – Default is 1024.
- num_samples_per_epoch (Optional *[*int ] , optional) – Default is None.
- seed (int , optional) – Default is 0.
- shuffle (bool , optional) – Default is False.
- num_workers (int , optional) – Default is 0.
- collate_fn (Optional *[*callable ] , optional) – Default is None.
- pin_memory (bool , optional) – Default is False.
- excluded_key_prefixes (Optional *[*List *[*str ] ] , optional) – Default is None.
- discard_short_samples (bool , optional) – Default is True.
- default_fs (Optional *[*int ] , optional) – Default is None.
- chunk_max_abs_length (Optional *[*int ] , optional) – Default is None.
Raises:ValueError – If chunk_length is empty or not in the expected format.
Yields:Iterator[Tuple[List[str], Dict[str, torch.Tensor]]] – A generator yielding tuples of IDs and batches of tensors.

build_iter(epoch: int, shuffle: bool | None = None) → Iterator[Tuple[List[str], Dict[str, Tensor]]]

Builds an iterator that generates chunks from the dataset.

This method creates an iterator for the dataset, yielding chunks of data based on the specified parameters such as chunk length and shift ratio. It supports varying chunk lengths and can shuffle the data if required.

Parameters:
- epoch (int) – The current epoch number, used for random state initialization.
- shuffle (Optional *[*bool ]) – If True, shuffles the data before yielding. If None, uses the default shuffle setting from the instance.
Yields:Iterator[Tuple[List[str], Dict[str, torch.Tensor]]] – A tuple containing a list of IDs and a dictionary of tensors for the corresponding batch.
Raises:
- RuntimeError – If the sequences in the batch do not have the same
- length**,** excluding those that match the excluded key patterns. –

######### Examples

>>> iter_factory = ChunkIterFactory(dataset, batches, 128, 1000)
>>> for ids, batch in iter_factory.build_iter(epoch=1):
...     print(ids, batch)

NOTE

The iterator maintains a cache of chunks to efficiently yield mini-batches. It handles cases where chunks need to be generated with overlapping segments for data augmentation.

prepare_for_collate(id_list, batches)

Prepares the data for collation by converting tensors to numpy arrays.

This method takes a list of IDs and corresponding batches, and converts the tensors in the batches to numpy arrays. It is typically used before batching the data together for further processing.

Parameters:
- id_list (List *[*str ]) – A list of identifiers for the data samples.
- batches (Dict *[*str , List *[*torch.Tensor ] ]) – A dictionary where keys are the names of the data fields and values are lists of tensors corresponding to those fields.
Returns: A list of tuples where each tuple contains an identifier and a dictionary of numpy arrays representing the data for that identifier.
Return type: List[Tuple[str, Dict[str, np.ndarray]]]

######### Examples

>>> id_list = ['id1', 'id2']
>>> batches = {
...     'feature1': [torch.tensor([[1, 2], [3, 4]]),
...                   torch.tensor([[5, 6], [7, 8]])],
...     'feature2': [torch.tensor([1]), torch.tensor([2])]
... }
>>> result = prepare_for_collate(id_list, batches)
>>> print(result)
[
    ('id1', {'feature1': array([[1, 2], [3, 4]]), 'feature2': array([1])}),
    ('id2', {'feature1': array([[5, 6], [7, 8]]), 'feature2': array([2])})
]