PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Abstract
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about ''what'', ''where'', ''when'', and ''how'' of a video. We make our work fully reproducible by providing data, training recipes, code & models.
Cite
Text
Cho et al. "PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding." Advances in Neural Information Processing Systems, 2025.Markdown
[Cho et al. "PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/cho2025neurips-perceptionlm/)BibTeX
@inproceedings{cho2025neurips-perceptionlm,
title = {{PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding}},
author = {Cho, Jang Hyun and Madotto, Andrea and Mavroudi, Effrosyni and Afouras, Triantafyllos and Nagarajan, Tushar and Maaz, Muhammad and Song, Yale and Ma, Tengyu and Hu, Shuming and Jain, Suyog and Martin, Miguel and Wang, Huiyu and Rasheed, Hanoona Abdul and Sun, Peize and Huang, Po-Yao and Bolya, Daniel and Ravi, Nikhila and Jain, Shashank and Stark, Tammy and Moon, Seungwhan and Damavandi, Babak and Lee, Vivian and Westbury, Andrew and Khan, Salman and Kraehenbuehl, Philipp and Dollar, Piotr and Torresani, Lorenzo and Grauman, Kristen and Feichtenhofer, Christoph},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/cho2025neurips-perceptionlm/}
}