On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Abstract

LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Cite

Text

Diep et al. "On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Diep et al. "On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/diep2025icml-zeroinitialized/)

BibTeX

@inproceedings{diep2025icml-zeroinitialized,
  title     = {{On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation}},
  author    = {Diep, Nghiem Tuong and Nguyen, Huy and Nguyen, Chau and Le, Minh and Nguyen, Duy Minh Ho and Sonntag, Daniel and Niepert, Mathias and Ho, Nhat},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {13713-13745},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/diep2025icml-zeroinitialized/}
}