Do LLMs Internally ``know'' When They Follow Instructions?

Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Shirley You Ren, Kwan Ho Ryan Chan, Udhyakumar Nallasamy, Andrew Miller, Jaya Narain

NeurIPSW 2024

/neuripsw/2024/heo2024neuripsw-llms/

Abstract

Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided guidelines. However, LLMs often fail to follow even simple instructions. To improve instruction-following behavior and prevent undesirable outputs, we need a deeper understanding of how LLMs' internal states relate to these outcomes. Our analysis of LLM internal states revealed a dimension in the input embedding space linked to successful instruction-following. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. This work provides insights into the internal workings of LLMs' instruction-following, paving the way for reliable LLM agents.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Heo et al. "Do LLMs Internally ``know'' When They Follow Instructions?." NeurIPS 2024 Workshops: MINT, 2024.

Markdown

[Heo et al. "Do LLMs Internally ``know'' When They Follow Instructions?." NeurIPS 2024 Workshops: MINT, 2024.](https://mlanthology.org/neuripsw/2024/heo2024neuripsw-llms/)

BibTeX

@inproceedings{heo2024neuripsw-llms,
  title     = {{Do LLMs Internally ``know'' When They Follow Instructions?}},
  author    = {Heo, Juyeon and Heinze-Deml, Christina and Elachqar, Oussama and Ren, Shirley You and Chan, Kwan Ho Ryan and Nallasamy, Udhyakumar and Miller, Andrew and Narain, Jaya},
  booktitle = {NeurIPS 2024 Workshops: MINT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/heo2024neuripsw-llms/}
}