Understanding Task Representations in Neural Networks via Bayesian Ablation
Abstract
Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.
Cite
Text
Nam et al. "Understanding Task Representations in Neural Networks via Bayesian Ablation." ICLR 2025 Workshops: Re-Align, 2025.Markdown
[Nam et al. "Understanding Task Representations in Neural Networks via Bayesian Ablation." ICLR 2025 Workshops: Re-Align, 2025.](https://mlanthology.org/iclrw/2025/nam2025iclrw-understanding/)BibTeX
@inproceedings{nam2025iclrw-understanding,
title = {{Understanding Task Representations in Neural Networks via Bayesian Ablation}},
author = {Nam, Andrew Joohun and Campbell, Declan Iain and Griffiths, Thomas L. and Cohen, Jonathan D. and Leslie, Sarah-Jane},
booktitle = {ICLR 2025 Workshops: Re-Align},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/nam2025iclrw-understanding/}
}