FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models Towards Adam‑Scale Speed

Dang, Sizhe; yangyangGuo,; Zhao, Yanjun; Zheng, Xiaodong; Dai, Guang; Tsang, Ivor; Ye, Haishan

FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models Towards Adam‑Scale Speed

Sizhe Dang, yangyangGuo, Yanjun Zhao, Xiaodong Zheng, Guang Dai, Ivor Tsang, Haishan Ye

ICLR 2026

/iclr/2026/dang2026iclr-fzoo/

Abstract

Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633~GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually need tens of times more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD, for instance, demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer towards Adam-Scale Speed. On the one hand, FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step-sizes based on the standard deviation of batch losses. On the other hand, it accelerates per-batch computation through the use of Rademacher random vector (±1) perturbations, which also enables further speedups through batched evaluation. Extensive experiments on diverse models (including RoBERTa-large, the OPT family (350M-66B), Phi-2, and Llama3) across 11 varied downstream tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by +3% in accuracy while requiring 3$\times$fewer forward passes. Notably, for the RoBERTa-large model, FZOO achieves average improvements of +5.6% in accuracy and 18$\times$reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO’s formal equivalence to a normalized-SGD update rule and establishing its convergence guarantees. Beyond full-parameter tuning, FZOO plugs smoothly into PEFT techniques, unlocking even larger memory savings. Taken together, our results make single-GPU, high-speed, full-parameter fine-tuning realistic today and point toward future work on memory-efficient pre-training. Code: https://github.com/DKmiyan/FZOO

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Dang et al. "FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models Towards Adam‑Scale Speed." International Conference on Learning Representations, 2026.

Markdown

[Dang et al. "FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models Towards Adam‑Scale Speed." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/dang2026iclr-fzoo/)

BibTeX

@inproceedings{dang2026iclr-fzoo,
  title     = {{FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models Towards Adam‑Scale Speed}},
  author    = {Dang, Sizhe and yangyangGuo,  and Zhao, Yanjun and Zheng, Xiaodong and Dai, Guang and Tsang, Ivor and Ye, Haishan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/dang2026iclr-fzoo/}
}