Learning to Instruct for Visual Instruction Tuning

Abstract

We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https://github.com/Feng-Hong/L2T.

Cite

Text

Zhou et al. "Learning to Instruct for Visual Instruction Tuning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhou et al. "Learning to Instruct for Visual Instruction Tuning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhou2025neurips-learning/)

BibTeX

@inproceedings{zhou2025neurips-learning,
  title     = {{Learning to Instruct for Visual Instruction Tuning}},
  author    = {Zhou, Zhihan and Hong, Feng and Luo, Jiaan and Ye, Yushi and Yao, Jiangchao and Li, Dongsheng and Han, Bo and Zhang, Ya and Wang, Yanfeng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhou2025neurips-learning/}
}