D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Choi, Suhwan; Jung, Jaeyoon; Seong, Haebin; Kim, Minchan; Kim, Minyeong; Cho, Yongjun; Kim, Yoonshik; Been, Park Yu; Yu, Youngjae; Lee, Yunsung

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Park Yu Been, Youngjae Yu, Yunsung Lee

ICLR 2026

/iclr/2026/choi2026iclr-d2e/

Abstract

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments---particularly gaming---offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152× compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations and 1K+ hours of pseudo-labeled gameplay), our 1B-parameter model achieves 96.6\% success on LIBERO manipulation and 83.3\% on CANVAS navigation, matching or surpassing models up to 7$\times$ larger, such as $\pi_0$ (3.3B) and OpenVLA (7B). These results demonstrate that sensorimotor primitives learned from digital interactions transfer effectively to real-world physical tasks, establishing desktop pretraining as a practical paradigm for embodied AI. All resources are publicly available at https://worv-ai.github.io/d2e.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Choi et al. "D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI." International Conference on Learning Representations, 2026.

Markdown

[Choi et al. "D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/choi2026iclr-d2e/)

BibTeX

@inproceedings{choi2026iclr-d2e,
  title     = {{D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI}},
  author    = {Choi, Suhwan and Jung, Jaeyoon and Seong, Haebin and Kim, Minchan and Kim, Minyeong and Cho, Yongjun and Kim, Yoonshik and Been, Park Yu and Yu, Youngjae and Lee, Yunsung},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/choi2026iclr-d2e/}
}