EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Abstract

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities with the majority focusing on the third-person perspective and only a few addressing specific tasks from the first-person perspective. However the capability of VLMs to "think" from a first-person perspective a crucial attribute for advancing autonomous agents and robotics remains largely unexplored. To bridge this research gap we introduce EgoThink a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs we evaluate twenty-one popular VLMs on EgoThink. Moreover given the open-ended format of the answers we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

Cite

Text

Cheng et al. "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01355

Markdown

[Cheng et al. "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/cheng2024cvpr-egothink/) doi:10.1109/CVPR52733.2024.01355

BibTeX

@inproceedings{cheng2024cvpr-egothink,
  title     = {{EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models}},
  author    = {Cheng, Sijie and Guo, Zhicheng and Wu, Jingwen and Fang, Kechen and Li, Peng and Liu, Huaping and Liu, Yang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14291-14302},
  doi       = {10.1109/CVPR52733.2024.01355},
  url       = {https://mlanthology.org/cvpr/2024/cheng2024cvpr-egothink/}
}