Implementability of Information Elicitation Mechanisms with Pre-Trained Language Models

Zachary Robertson, Hannah Cha, Andrew Sheha, Sanmi Koyejo

ICMLW 2024

/icmlw/2024/robertson2024icmlw-implementability/

Abstract

As language models become increasingly sophisticated, ensuring the faithfulness of their outputs to the input and the consistency of their reasoning across outputs is a critical challenge. To address the scalability issues in overseeing these aspects, we propose a novel approach based on information-theoretic measures for detecting manipulated or unfaithful reasoning. We propose a Difference of Entropies (DoE) estimator to quantify the difference in mutual information between outputs, providing a principled way to identify low-quality or inconsistent content. We theoretically analyze the DoE estimator, proving its incentive-compatibility properties and deriving scaling laws for f-mutual information as a function of sample size. Motivated by the theory, we implement the estimator using an LLM on a simple machine translation task and a dataset of peer reviews from ICLR 2023, considering various manipulation types. Across these scenarios, the DoE estimator consistently assigns higher scores to unmodified reviews compared to manipulated ones and correlates with BLEU, demonstrating its effectiveness in ensuring the reliability of language model reasoning. These results highlight the potential of information-theoretic approaches for scalable oversight of advanced AI systems.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Robertson et al. "Implementability of Information Elicitation Mechanisms with Pre-Trained Language Models." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Robertson et al. "Implementability of Information Elicitation Mechanisms with Pre-Trained Language Models." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/robertson2024icmlw-implementability/)

BibTeX

@inproceedings{robertson2024icmlw-implementability,
  title     = {{Implementability of Information Elicitation Mechanisms with Pre-Trained Language Models}},
  author    = {Robertson, Zachary and Cha, Hannah and Sheha, Andrew and Koyejo, Sanmi},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/robertson2024icmlw-implementability/}
}