LLaVA-Video: Video Instruction Tuning with Synthetic Data
Abstract
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Cite
Text
Zhang et al. "LLaVA-Video: Video Instruction Tuning with Synthetic Data." Transactions on Machine Learning Research, 2025.Markdown
[Zhang et al. "LLaVA-Video: Video Instruction Tuning with Synthetic Data." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/zhang2025tmlr-llavavideo/)BibTeX
@article{zhang2025tmlr-llavavideo,
title = {{LLaVA-Video: Video Instruction Tuning with Synthetic Data}},
author = {Zhang, Yuanhan and Wu, Jinming and Li, Wei and Li, Bo and Ma, Zejun and Liu, Ziwei and Li, Chunyuan},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/zhang2025tmlr-llavavideo/}
}