espnet2.samplers.category_power_sampler.CategoryPowerSampler

About 1 min

espnet2.samplers.category_power_sampler.CategoryPowerSampler

class espnet2.samplers.category_power_sampler.CategoryPowerSampler(batch_bins: int, shape_files: Tuple[str, ...] | List[str], min_batch_size: int = 1, max_batch_size: int | None = None, upsampling_factor: float = 1.0, dataset_scaling_factor: float = 1.2, drop_last: bool = False, category2utt_file: str | None = None, epoch: int = 1, **kwargs)

Bases: AbsSampler

A category-balanced batch sampler with power-law sampling.

Reference: : Scaling Speech Technology to 1,000+ Languages https://arxiv.org/pdf/2305.13516

This sampler constructs mini-batches by balancing samples across categories (e.g., language IDs), using a power-law distribution to control the sampling frequency. Originally developed for language identification, it can be applied to any dataset that provides a mapping from category (e.g., language) to utterances.

Sampling Strategy:

Given:

l ∈ {1, 2, …, L}, the set of category labels
n_l: total duration (number of bins) of category l
N: total duration (number of bins) of all categories in the dataset
β: upsampling factor
k_l: the number of utterances in category l

We define:

Category-level sampling probability:

P(l) = (n_l / N)^β

Utterance-level conditional sampling: : P(x | l) = 1 / k_l
Combined sampling probability: : P(x) = P(l) * P(x | l) = (n_l / N)^β * (1 / k_l)

Where β ∈ [0, 1] is the upsampling_factor:

β → 0 emphasizes low-resource categories (strong upsampling)
β → 1 approximates uniform sampling over all utterances

Note:

Batches are constructed based on batch_bins, similar to LengthBatchSampler.
Set batch_type=catpow in your configuration to use this sampler.

Parameters:
- batch_bins – The approximate maximum number of bins (e.g., audio samples) in a batch.
- shape_files – A list or tuple of shape file paths. Only one shape file is supported, but the list format is retained for compatibility with other samplers.
- min_batch_size – Minimum number of utterances in a batch.
- max_batch_size – Maximum number of utterances in a batch (recommended for memory safety).
- upsampling_factor – β in the sampling formula; controls how strongly to upsample low-resource categories.
- dataset_scaling_factor – A multiplier that determines the total number of utterances sampled. Values > 1 simulate more frequent use of low-resource utterances across batches. Must be ≥ 1.
- drop_last – Whether to drop the final batch.
- category2utt_file – Path to a file mapping each category to utterance ID.
- epoch – Random seed is set using the epoch to ensure reproducibility with variation across epochs.