Take a Step Back: Rethinking the Two Stages in Visual Reasoning

Abstract

As a prominent research area, visual reasoning plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning. Our code is publicly available at https://mybearyzhang.github.io/projects/ TwoStageReason.

Cite

Text

Zhang et al. "Take a Step Back: Rethinking the Two Stages in Visual Reasoning." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72775-7_8

Markdown

[Zhang et al. "Take a Step Back: Rethinking the Two Stages in Visual Reasoning." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-take/) doi:10.1007/978-3-031-72775-7_8

BibTeX

@inproceedings{zhang2024eccv-take,
  title     = {{Take a Step Back: Rethinking the Two Stages in Visual Reasoning}},
  author    = {Zhang, Mingyu and Cai, Jiting and Liu, Mingyu and Xu, Yue and Lu, Cewu and Li, Yong-Lu},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72775-7_8},
  url       = {https://mlanthology.org/eccv/2024/zhang2024eccv-take/}
}