Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)
Abstract
Probing classifiers are a technique for understanding and modifying the operation of neural networks in which a smaller classifier is trained to use the model's internal representation to learn a related probing task. Similar to a neural electrode array, training probing classifiers can help researchers both discern and edit the internal representation of a neural network. This paper presents an evaluation of the use of probing classifiers to modify the internal hidden state of a chess-playing transformer. We demonstrate that intervention vector scaling should follow a negative exponential according to the length of the input to ensure model outputs remain semantically valid after editing the residual stream activations.
Cite
Text
Davis and Sukthankar. "Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35245Markdown
[Davis and Sukthankar. "Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/davis2025aaai-scaling/) doi:10.1609/AAAI.V39I28.35245BibTeX
@inproceedings{davis2025aaai-scaling,
title = {{Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)}},
author = {Davis, Austin L. and Sukthankar, Gita},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {29343-29344},
doi = {10.1609/AAAI.V39I28.35245},
url = {https://mlanthology.org/aaai/2025/davis2025aaai-scaling/}
}