Mobile OS Task Procedure Extraction from YouTube
Abstract
We present MOTIFY, a novel approach for predicting scene transitions and actions from mobile operating system (OS) task videos. By leveraging pretrained Vision-Language Models (VLMs), MOTIFY extract the task sequences from real-world YouTube videos without manual annotation. Our method addresses the limitations of existing approaches, which rely on manual data annotation or simulation environments. We demonstrate MOTIFY's effectiveness on a diverse set of mobile OS tasks across multiple platforms, outperforming baseline methods in scene transition detection and action prediction. This approach opens new possibilities for scalable, real-world mobile agent development and video understanding research.
Cite
Text
Jang et al. "Mobile OS Task Procedure Extraction from YouTube." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.Markdown
[Jang et al. "Mobile OS Task Procedure Extraction from YouTube." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.](https://mlanthology.org/neuripsw/2024/jang2024neuripsw-mobile/)BibTeX
@inproceedings{jang2024neuripsw-mobile,
title = {{Mobile OS Task Procedure Extraction from YouTube}},
author = {Jang, Yunseok and Song, Yeda and Sohn, Sungryull and Logeswaran, Lajanugen and Luo, Tiange and Lee, Honglak},
booktitle = {NeurIPS 2024 Workshops: Video-Langauge_Models},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/jang2024neuripsw-mobile/}
}