Angell, Rico
9 publications
ICLR
2026
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
NeurIPS
2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
9 publications