Continual SFT Matches Multimodal RLHF with Negative Supervision

Abstract

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. Code will be found in https://github.com/Kevinz-code/nSFT/.

Cite

Text

Zhu et al. "Continual SFT Matches Multimodal RLHF with Negative Supervision." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01362

Markdown

[Zhu et al. "Continual SFT Matches Multimodal RLHF with Negative Supervision." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhu2025cvpr-continual/) doi:10.1109/CVPR52734.2025.01362

BibTeX

@inproceedings{zhu2025cvpr-continual,
  title     = {{Continual SFT Matches Multimodal RLHF with Negative Supervision}},
  author    = {Zhu, Ke and Wang, Yu and Sun, Yanpeng and Chen, Qiang and Liu, Jiangjiang and Zhang, Gang and Wang, Jingdong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14615-14624},
  doi       = {10.1109/CVPR52734.2025.01362},
  url       = {https://mlanthology.org/cvpr/2025/zhu2025cvpr-continual/}
}