Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Abstract
Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron–concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior. Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation ranges within a neuron. Extensive empirical evaluations show that range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking.
Cite
Text
Haider et al. "Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution." Transactions on Machine Learning Research, 2026.Markdown
[Haider et al. "Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/haider2026tmlr-neurons/)BibTeX
@article{haider2026tmlr-neurons,
title = {{Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution}},
author = {Haider, Muhammad Umair and Rizwan, Hammad and Sajjad, Hassan and Ju, Peizhong and Siddique, A.B.},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/haider2026tmlr-neurons/}
}