The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics

Abstract

We introduce the \emph{Alignment Trilemma} as a theoretical framework to explain the recursive misalignment observed in contemporary AI alignment methods. Our formulation decomposes misalignment into three interdependent components---direct alignment, capability preservation, and meta-alignment---whose conflicting optimization can trigger cycles of drift. In light of recent work on human-AI adaptation dynamics \citep{Shen2024Bidirectional, Carroll2024DRMDP, Harland2024MORL} and adaptive teaming architectures \citep{Ni2021Adaptive, Mahmood2024Behavior}, we propose a holistic approach that includes a novel metric, the \emph{Alignment Performance Score (APS)}, which captures the overall quality of alignment across these three dimensions. Our insights aim to guide the development of AI systems that co-evolve safely with human partners.

Cite

Text

Raheja and Pochhi. "The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics." ICLR 2025 Workshops: Bi-Align, 2025.

Markdown

[Raheja and Pochhi. "The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics." ICLR 2025 Workshops: Bi-Align, 2025.](https://mlanthology.org/iclrw/2025/raheja2025iclrw-alignment/)

BibTeX

@inproceedings{raheja2025iclrw-alignment,
  title     = {{The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics}},
  author    = {Raheja, Tarun and Pochhi, Nilay},
  booktitle = {ICLR 2025 Workshops: Bi-Align},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/raheja2025iclrw-alignment/}
}