Mobile OS Task Procedure Extraction from YouTube

Abstract

We present MOTIFY, a novel approach for predicting scene transitions and actions from mobile operating system (OS) task videos. By leveraging pretrained Vision-Language Models (VLMs), MOTIFY extract the task sequences from real-world YouTube videos without manual annotation. Our method addresses the limitations of existing approaches, which rely on manual data annotation or simulation environments. We demonstrate MOTIFY's effectiveness on a diverse set of mobile OS tasks across multiple platforms, outperforming baseline methods in scene transition detection and action prediction. This approach opens new possibilities for scalable, real-world mobile agent development and video understanding research.

Cite

Text

Jang et al. "Mobile OS Task Procedure Extraction from YouTube." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.

Markdown

[Jang et al. "Mobile OS Task Procedure Extraction from YouTube." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.](https://mlanthology.org/neuripsw/2024/jang2024neuripsw-mobile/)

BibTeX

@inproceedings{jang2024neuripsw-mobile,
  title     = {{Mobile OS Task Procedure Extraction from YouTube}},
  author    = {Jang, Yunseok and Song, Yeda and Sohn, Sungryull and Logeswaran, Lajanugen and Luo, Tiange and Lee, Honglak},
  booktitle = {NeurIPS 2024 Workshops: Video-Langauge_Models},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/jang2024neuripsw-mobile/}
}