Finding Sparse Autoencoder Representations of Errors in CoT Prompting
Abstract
Current large language models often suffer from subtle, hard-to-detect reasoning errors in their intermediate chain-of-thought (CoT) steps. These errors include logical inconsistencies, factual hallucinations, and arithmetic mistakes, which compromise trust and reliability. While previous research focuses on mechanistic interpretability for best output, understanding and categorizing internal reasoning errors remains challenging. The complexity and non-linear nature of these CoT sequences call for methods to uncover structured patterns hidden within them. As an initial step, we evaluate Sparse Autoencoder (SAE) activations within neural networks to investigate how specific neurons contribute to different types of errors.
Cite
Text
Theodorus et al. "Finding Sparse Autoencoder Representations of Errors in CoT Prompting." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Theodorus et al. "Finding Sparse Autoencoder Representations of Errors in CoT Prompting." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/theodorus2025iclrw-finding/)BibTeX
@inproceedings{theodorus2025iclrw-finding,
title = {{Finding Sparse Autoencoder Representations of Errors in CoT Prompting}},
author = {Theodorus, Justin and Swaytha, V and Gautam, Shivani and Ward, Adam and Shah, Mahir and Blondin, Cole and Zhu, Kevin},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/theodorus2025iclrw-finding/}
}