LLaVA-OneVision: Easy Visual Task Transfer

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Cite

Text

Li et al. "LLaVA-OneVision: Easy Visual Task Transfer." Transactions on Machine Learning Research, 2025.

Markdown

[Li et al. "LLaVA-OneVision: Easy Visual Task Transfer." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/li2025tmlr-llavaonevision/)

BibTeX

@article{li2025tmlr-llavaonevision,
  title     = {{LLaVA-OneVision: Easy Visual Task Transfer}},
  author    = {Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/li2025tmlr-llavaonevision/}
}