Experts Don't Cheat: Learning What You Don't Know by Predicting Pairs
Abstract
Identifying how much a model $\hat{p}\_{\scriptscriptstyle{Y|X}}^{\theta}$ knows about the stochastic real-world process $p\_{\scriptscriptstyle{Y|X}}$ it was trained on is important to ensure it avoids producing "hallucinated" answers or taking unsafe actions, but this is difficult for generative models because probabilistic predictions do not distinguish between per-response noise (aleatoric uncertainty) and lack of knowledge about the process (epistemic uncertainty). We propose a general strategy for decomposing these: train a model to predict *pairs* of independent responses drawn from the true distribution, allow it to "cheat" by observing one response while predicting the other, then measure how much it cheats. We prove that this strategy incentivizes models to become *second-order calibrated*, which allows you to both accurately estimate the gaps between $\hat{p}\_{\scriptscriptstyle{Y|X}}^{\theta}$ and also$p\_{\scriptscriptstyle{Y|X}}$ and construct decoding algorithms with bounded probability of generating an incorrect statement. Empirically, we show that our strategy outperforms other filtering methods on a synthetic language modeling task (describing digits of $\pi$).
Cite
Text
Johnson et al. "Experts Don't Cheat: Learning What You Don't Know by Predicting Pairs." ICLR 2024 Workshops: R2-FM, 2024.Markdown
[Johnson et al. "Experts Don't Cheat: Learning What You Don't Know by Predicting Pairs." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/johnson2024iclrw-experts/)BibTeX
@inproceedings{johnson2024iclrw-experts,
title = {{Experts Don't Cheat: Learning What You Don't Know by Predicting Pairs}},
author = {Johnson, Daniel D. and Tarlow, Daniel and Duvenaud, David and Maddison, Chris J.},
booktitle = {ICLR 2024 Workshops: R2-FM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/johnson2024iclrw-experts/}
}