Lightweight Neural App Control

Abstract

This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

Cite

Text

Christianos et al. "Lightweight Neural App Control." International Conference on Learning Representations, 2025.

Markdown

[Christianos et al. "Lightweight Neural App Control." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/christianos2025iclr-lightweight/)

BibTeX

@inproceedings{christianos2025iclr-lightweight,
  title     = {{Lightweight Neural App Control}},
  author    = {Christianos, Filippos and Papoudakis, Georgios and Coste, Thomas and Hao, Jianye and Wang, Jun and Shao, Kun},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/christianos2025iclr-lightweight/}
}