AppVLM: A Lightweight Vision Language Model for Online App Control

Abstract

The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results show that AppVLM achieves the highest offline action prediction accuracy in AndroidControl, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate on AndroidWorld, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.

Cite

Text

Papoudakis et al. "AppVLM: A Lightweight Vision Language Model for Online App Control." ICLR 2025 Workshops: FM-Wild, 2025.

Markdown

[Papoudakis et al. "AppVLM: A Lightweight Vision Language Model for Online App Control." ICLR 2025 Workshops: FM-Wild, 2025.](https://mlanthology.org/iclrw/2025/papoudakis2025iclrw-appvlm/)

BibTeX

@inproceedings{papoudakis2025iclrw-appvlm,
  title     = {{AppVLM: A Lightweight Vision Language Model for Online App Control}},
  author    = {Papoudakis, Georgios and Coste, Thomas and Wu, Zhihao and Hao, Jianye and Wang, Jun and Shao, Kun},
  booktitle = {ICLR 2025 Workshops: FM-Wild},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/papoudakis2025iclrw-appvlm/}
}