espnet2.samplers.category_power_sampler.CategoryDatasetPowerSampler
espnet2.samplers.category_power_sampler.CategoryDatasetPowerSampler
class espnet2.samplers.category_power_sampler.CategoryDatasetPowerSampler(batch_bins: int, shape_files: Tuple[str, ...] | List[str], min_batch_size: int = 1, max_batch_size: int | None = None, category_upsampling_factor: float = 1.0, dataset_upsampling_factor: float = 1.0, dataset_scaling_factor: float = 1.2, drop_last: bool = False, category2utt_file: str | None = None, dataset2utt_file: str | None = None, utt2dataset_file: str | None = None, epoch: int = 1, **kwargs)
Bases: AbsSampler
A category- and dataset-balanced batch sampler with power-law sampling.
Reference: : Scaling Speech Technology to 1,000+ Languages https://arxiv.org/pdf/2305.13516
This sampler is designed for multi-category, multi-dataset training where both category imbalance and dataset imbalance exist. It performs hierarchical sampling: (1) balancing categories (e.g., languages) within each dataset, and (2) balancing datasets themselves.
Sampling Strategy:
Let:
- d β {1, 2, β¦, D} denote the dataset index
- l β {1, 2, β¦, L_d} denote the category index in dataset d
- n_ld: total duration (number of bins) of category l in dataset d
- k_ld: the number of utterances in category l in dataset d
- N_d = β_l n_ld: total duration (number of bins) of all categories
in dataset d
- M = β_d N_d: total duration (number of bins) of all categories across : all datasets
Step 1 β Category-level sampling within each dataset: : P(l | d) β (n_ld / N_d)^Ξ²_L
where Ξ²_L (category_upsampling_factor) controls how strongly to upsample low-resource languages within each dataset. The normalized probability becomes:
P(l | d) = [(n_ld / N_d)^Ξ²_L] / β_lβ[(n_lβd / N_d)^Ξ²_L]
Step 2 β Dataset-level sampling based on resampled language distributions:
For each dataset d, the resampled number of bins for category l is: : n_ldβ = N_d Γ P(l | d)
Since the category probabilities sum to 1 within each dataset (β_l P(l | d) = 1), the total resampled bins (N_dβ) for dataset d is:
N_dβ = β_l n_ldβ = N_d
The probability of sampling dataset d is then: : P(d) = [(N_d / M)^Ξ²_D] / β_d[(N_d / M)^Ξ²_D]
where:
- Ξ²_D is dataset_upsampling_factor
Final utterance sampling probability: : P(x) = P(d) Γ P(l | d) Γ P(x | l, d), where P(x | l, d) = 1 / k_ld
Note:
- Batches are constructed based on batch_bins, similar to LengthBatchSampler.
- Set batch_type=catpow_balance_dataset to enable this sampler.
- This sampler is particularly useful when combining heterogeneous datasets (e.g., FLEURS + VoxLingua107 + BABEL) with highly imbalanced language and size distributions.
- Parameters:
- batch_bins β The approximate maximum number of bins (e.g., audio samples) in a batch.
- shape_files β A list or tuple of shape file paths. Only one shape file is supported, but the list format is retained for compatibility with other samplers.
- min_batch_size β Minimum number of utterances in a batch.
- max_batch_size β Maximum number of utterances in a batch (recommended for memory safety).
- category_upsampling_factor β Ξ²_L in the formula; controls per-dataset category balancing.
- dataset_upsampling_factor β Ξ²_D in the formula; controls balancing between datasets.
- dataset_scaling_factor β A multiplier that determines the total number of utterances sampled. Values > 1 simulate more frequent use of low-resource utterances across batches. Must be β₯ 1.
- drop_last β Whether to drop the final batch.
- category2utt_file β Path to a file mapping each category to utterance ID.
- dataset2utt_file β Path to a file mapping each dataset to utterance ID.
- utt2dataset_file β Path to a file mapping each utterance ID to its corresponding dataset label.
- epoch β Random seed is set using the epoch to ensure reproducibility with variation across epochs.
