Menke, Maluna

1 publications

TMLR 2026 Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff