Estimation of Large Zipfian Distributions with Sort and Snap

Abstract

We study the estimation of Zipfian distributions under $L_1$ loss, and provide near minimax optimal bounds in several regimes. Specifically, we assume observations arrive from a known alphabet, and with a known decay rate parametrizing the Zipfian, but we do not know a priori which alphabet elements have larger probability than others. We present a novel Sort and Snap estimator, which uses the empirical proportions to sort the alphabet, and then snaps them to the associated term from the Zipfian distribution. For arbitrary decay rates and smaller alphabet sizes, as well as for large decay rates and large alphabet sizes, we show an exact or minor variant of this estimator is near minimax optimal and has exponential improvement over the standard empirical proportion estimator. However, for small decay rates and larger alphabet sizes a simulation study indicates the standard empirical proportion estimator is competitive with Sort and Snap procedures. In addition to providing nearly tight bounds for important high-dimensional estimation problems, we believe the Sort and Snap estimator, and its analysis, will have independent interest.

Cite

Text

Jacobs et al. "Estimation of Large Zipfian Distributions with Sort and Snap." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.

Markdown

[Jacobs et al. "Estimation of Large Zipfian Distributions with Sort and Snap." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.](https://mlanthology.org/aistats/2025/jacobs2025aistats-estimation/)

BibTeX

@inproceedings{jacobs2025aistats-estimation,
  title     = {{Estimation of Large Zipfian Distributions with Sort and Snap}},
  author    = {Jacobs, Peter Matthew and Bhattacharya, Anirban and Pati, Debdeep and Patel, Lekha and Phillips, Jeff M.},
  booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  year      = {2025},
  pages     = {190-198},
  volume    = {258},
  url       = {https://mlanthology.org/aistats/2025/jacobs2025aistats-estimation/}
}