Language Models Linearly Represent Sentiment

Abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. In a causal analysis, we isolate this direction using interventions and show it is causally active in both toy tasks and real world datasets such as Stanford Sentiment Treebank. We analyze the mechanisms that involve this direction and discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, ablating the sentiment direction across all tokens results in a drop in accuracy from 100% to 62% (vs. 50% random baseline), while ablating the summarized sentiment direction at comma positions alone produces close to half this result (reducing accuracy to 82%).

Cite

Text

Tigges et al. "Language Models Linearly Represent Sentiment." ICML 2024 Workshops: MI, 2024.

Markdown

[Tigges et al. "Language Models Linearly Represent Sentiment." ICML 2024 Workshops: MI, 2024.](https://mlanthology.org/icmlw/2024/tigges2024icmlw-language/)

BibTeX

@inproceedings{tigges2024icmlw-language,
  title     = {{Language Models Linearly Represent Sentiment}},
  author    = {Tigges, Curt and Hollinsworth, Oskar John and Geiger, Atticus and Nanda, Neel},
  booktitle = {ICML 2024 Workshops: MI},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/tigges2024icmlw-language/}
}