A Theoretical Understanding of Self-Correction Through In-Context Alignment

Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, Yisen Wang

ICMLW 2024

/icmlw/2024/wang2024icmlw-theoretical-a/

Abstract

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.

PDF ICMLW OpenReview Semantic Scholar

Cite

Text

Wang et al. "A Theoretical Understanding of Self-Correction Through In-Context Alignment." ICML 2024 Workshops: TF2M, 2024.

Markdown

[Wang et al. "A Theoretical Understanding of Self-Correction Through In-Context Alignment." ICML 2024 Workshops: TF2M, 2024.](https://mlanthology.org/icmlw/2024/wang2024icmlw-theoretical-a/)

BibTeX

@inproceedings{wang2024icmlw-theoretical-a,
  title     = {{A Theoretical Understanding of Self-Correction Through In-Context Alignment}},
  author    = {Wang, Yifei and Wu, Yuyang and Wei, Zeming and Jegelka, Stefanie and Wang, Yisen},
  booktitle = {ICML 2024 Workshops: TF2M},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/wang2024icmlw-theoretical-a/}
}