D'Oosterlinck, Karel

3 publications

ICLR 2025 HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger

NeurIPSW 2024 Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, Shikib Mehri

NeurIPS 2022 CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior Eldar D Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, Zhengxuan Wu