VideoAgentTrek: Computer-Use Pretraining from Unlabeled Videos

Abstract

Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

Cite

Text

Lu et al. "VideoAgentTrek: Computer-Use Pretraining from Unlabeled Videos." International Conference on Learning Representations, 2026.

Markdown

[Lu et al. "VideoAgentTrek: Computer-Use Pretraining from Unlabeled Videos." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lu2026iclr-videoagenttrek/)

BibTeX

@inproceedings{lu2026iclr-videoagenttrek,
  title     = {{VideoAgentTrek: Computer-Use Pretraining from Unlabeled Videos}},
  author    = {Lu, Dunjie and Xu, Yiheng and Wang, Junli and Wu, Haoyuan and Wang, Xinyuan and Wang, Zekun and Yang, Junlin and Su, Hongjin and Chen, Jixuan and Chen, Junda and Mao, Yuchen and Lin, Junyang and Hui, Binyuan and Yu, Tao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lu2026iclr-videoagenttrek/}
}