Spa-Bench: A Comprehensive Benchmark for Smartphone Agent Evaluation

Chen, Jingxuan; Yuen, Derek; Xie, Bin; Yang, Yuhao; Chen, Gongwei; Wu, Zhihao; Yixing, Li; Zhou, Xurui; Liu, Weiwen; Wang, Shuai; Shao, Rui; Nie, Liqiang; Wang, Yasheng; Hao, Jianye; Wang, Jun; Shao, Kun

Spa-Bench: A Comprehensive Benchmark for Smartphone Agent Evaluation

Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

NeurIPSW 2024

/neuripsw/2024/chen2024neuripsw-spabench/

Abstract

Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based agents emerging as key contenders. Fairly comparing these agents is essential but chal- lenging, requiring a diverse task scope, the integration of agents with different im- plementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-BENCH, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an end-to-end setting. SPA-BENCH offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over 10 agents with the flex- ibility to add more, regardless of their underlying models or how they interact with the environment; (3) A novel evaluation pipeline that assesses agent perfor- mance across multiple dimensions, using coarse-to-fine success detection along- side completion- and consumption-related metrics. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and resource consumption. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Chen et al. "Spa-Bench: A Comprehensive Benchmark for Smartphone Agent Evaluation." NeurIPS 2024 Workshops: OWA, 2024.

Markdown

[Chen et al. "Spa-Bench: A Comprehensive Benchmark for Smartphone Agent Evaluation." NeurIPS 2024 Workshops: OWA, 2024.](https://mlanthology.org/neuripsw/2024/chen2024neuripsw-spabench/)

BibTeX

@inproceedings{chen2024neuripsw-spabench,
  title     = {{Spa-Bench: A Comprehensive Benchmark for Smartphone Agent Evaluation}},
  author    = {Chen, Jingxuan and Yuen, Derek and Xie, Bin and Yang, Yuhao and Chen, Gongwei and Wu, Zhihao and Yixing, Li and Zhou, Xurui and Liu, Weiwen and Wang, Shuai and Shao, Rui and Nie, Liqiang and Wang, Yasheng and Hao, Jianye and Wang, Jun and Shao, Kun},
  booktitle = {NeurIPS 2024 Workshops: OWA},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/chen2024neuripsw-spabench/}
}