Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Abstract

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

Cite

Text

Chen et al. "Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs." Transactions on Machine Learning Research, 2024.

Markdown

[Chen et al. "Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/chen2024tmlr-two/)

BibTeX

@article{chen2024tmlr-two,
  title     = {{Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs}},
  author    = {Chen, Angelica and Phang, Jason and Parrish, Alicia and Padmakumar, Vishakh and Zhao, Chen and Bowman, Samuel R. and Cho, Kyunghyun},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/chen2024tmlr-two/}
}