On the Null Distribution of the Precision and Recall Curve

Abstract

Precision recall curves (pr-curves) and the associated area under (AUPRC) are commonly used to assess the accuracy of information retrieval (IR) algorithms. An informative baseline is random selection. The associated probability distribution makes it possible to assess pr-curve significancy (as a p-value relative to the null of random). To our knowledge, no analytical expression of the null distribution of empirical pr-curves is available, and the only measure of significancy used in the literature relies on non-parametric Monte Carlo simulations. In this paper, we derive analytically the expected null pr-curve and AUPRC, for different interpolation strategies. The AUPRC variance is also derived, and we use it to propose a continuous approximation to the null AUPRC distribution, based on the beta distribution. Properties of the empirical pr-curve and common interpolation strategies are also discussed.

Cite

Text

Lopes and Bontempi. "On the Null Distribution of the Precision and Recall Curve." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2014. doi:10.1007/978-3-662-44851-9_21

Markdown

[Lopes and Bontempi. "On the Null Distribution of the Precision and Recall Curve." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2014.](https://mlanthology.org/ecmlpkdd/2014/lopes2014ecmlpkdd-null/) doi:10.1007/978-3-662-44851-9_21

BibTeX

@inproceedings{lopes2014ecmlpkdd-null,
  title     = {{On the Null Distribution of the Precision and Recall Curve}},
  author    = {Lopes, Miguel and Bontempi, Gianluca},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2014},
  pages     = {322-337},
  doi       = {10.1007/978-3-662-44851-9_21},
  url       = {https://mlanthology.org/ecmlpkdd/2014/lopes2014ecmlpkdd-null/}
}