On Reward Functions for Self-Improving Chain-of-Thought Reasoning Without Supervised Datasets (Abridged Version)

Abstract

Prompting a Large Language Model (LLM) to output Chain-of-Thought (CoT) reasoning improves performance on complex problem-solving tasks. Moreover, several popular approaches exist to "self-improve'' the CoT reasoning abilities of LLMs on tasks where supervised (question, answer) datasets are already available. An emerging line of work explores whether self-improvement is possible without these supervised datasets, instead utilizing the same large, unstructured text corpora as used during pretraining. This would overcome the data availability bottleneck present in current self-improvement methods, and open the door towards *compute-only scaling* of language model reasoning ability. We investigate a fundamental question in this line of work: What constitutes a suitable reward function for learning to reason during general language model pretraining? We empirically demonstrate how different functions affect what reasoning is learnt and where reasoning is rewarded. Using these insights, we introduce a novel reward function called Reasoning Advantage (RA) that facilitates self-improving CoT reasoning on free-form question-answering (QA) data, where answers are unstructured and difficult to verify. We also explore the optimization of RA on general unstructured text using offline RL, and our analysis indicates that future work should investigate more powerful optimization algorithms, potentially moving towards more online algorithms that better explore the space of CoT generations.

Cite

Text

Foster et al. "On Reward Functions for Self-Improving Chain-of-Thought Reasoning Without Supervised Datasets (Abridged Version)." NeurIPS 2024 Workshops: LanGame, 2024.

Markdown

[Foster et al. "On Reward Functions for Self-Improving Chain-of-Thought Reasoning Without Supervised Datasets (Abridged Version)." NeurIPS 2024 Workshops: LanGame, 2024.](https://mlanthology.org/neuripsw/2024/foster2024neuripsw-reward/)

BibTeX

@inproceedings{foster2024neuripsw-reward,
  title     = {{On Reward Functions for Self-Improving Chain-of-Thought Reasoning Without Supervised Datasets (Abridged Version)}},
  author    = {Foster, Thomas and Ahmed, Eltayeb and Cook, Jonathan and Lifshitz, Shalev and Rocktäschel, Tim and Foerster, Jakob Nicolaus},
  booktitle = {NeurIPS 2024 Workshops: LanGame},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/foster2024neuripsw-reward/}
}