ML Anthology
Authors
Search
About
D'Oosterlinck, Karel
3 publications
ICLR
2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
,
Jing Huang
,
Sidharth Baskaran
,
Karel D'Oosterlinck
,
Christopher Potts
,
Michael Sklar
,
Atticus Geiger
NeurIPSW
2024
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment
Karel D'Oosterlinck
,
Winnie Xu
,
Chris Develder
,
Thomas Demeester
,
Amanpreet Singh
,
Christopher Potts
,
Douwe Kiela
,
Shikib Mehri
NeurIPS
2022
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior
Eldar D Abraham
,
Karel D'Oosterlinck
,
Amir Feder
,
Yair Gat
,
Atticus Geiger
,
Christopher Potts
,
Roi Reichart
,
Zhengxuan Wu