espnet2.samplers.category_power_sampler.CategoryPowerSampler
espnet2.samplers.category_power_sampler.CategoryPowerSampler
class espnet2.samplers.category_power_sampler.CategoryPowerSampler(batch_bins: int, shape_files: Tuple[str, ...] | List[str], min_batch_size: int = 1, max_batch_size: int | None = None, upsampling_factor: float = 1.0, dataset_scaling_factor: float = 1.2, drop_last: bool = False, category2utt_file: str | None = None, epoch: int = 1, **kwargs)
Bases: AbsSampler
A category-balanced batch sampler with power-law sampling.
Reference: : Scaling Speech Technology to 1,000+ Languages https://arxiv.org/pdf/2305.13516
This sampler constructs mini-batches by balancing samples across categories (e.g., language IDs), using a power-law distribution to control the sampling frequency. Originally developed for language identification, it can be applied to any dataset that provides a mapping from category (e.g., language) to utterances.
Sampling Strategy:
Given:
- l β {1, 2, β¦, L}, the set of category labels
- n_l: total duration (number of bins) of category l
- N: total duration (number of bins) of all categories in the dataset
- Ξ²: upsampling factor
- k_l: the number of utterances in category l
We define:
- Category-level sampling probability:
P(l) = (n_l / N)^Ξ²
- Utterance-level conditional sampling: : P(x | l) = 1 / k_l
- Combined sampling probability: : P(x) = P(l) * P(x | l) = (n_l / N)^Ξ² * (1 / k_l)
Where Ξ² β [0, 1] is the upsampling_factor:
- Ξ² β 0 emphasizes low-resource categories (strong upsampling)
- Ξ² β 1 approximates uniform sampling over all utterances
Note:
- Batches are constructed based on batch_bins, similar to LengthBatchSampler.
- Set batch_type=catpow in your configuration to use this sampler.
- Parameters:
- batch_bins β The approximate maximum number of bins (e.g., audio samples) in a batch.
- shape_files β A list or tuple of shape file paths. Only one shape file is supported, but the list format is retained for compatibility with other samplers.
- min_batch_size β Minimum number of utterances in a batch.
- max_batch_size β Maximum number of utterances in a batch (recommended for memory safety).
- upsampling_factor β Ξ² in the sampling formula; controls how strongly to upsample low-resource categories.
- dataset_scaling_factor β A multiplier that determines the total number of utterances sampled. Values > 1 simulate more frequent use of low-resource utterances across batches. Must be β₯ 1.
- drop_last β Whether to drop the final batch.
- category2utt_file β Path to a file mapping each category to utterance ID.
- epoch β Random seed is set using the epoch to ensure reproducibility with variation across epochs.
