Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

CVPR 2024 pp. 14261-14270

doi:10.1109/CVPR52733.2024.01352 /cvpr/2024/wang2024cvpr-omniq/

Abstract

Unsupervised visual grounding methods alleviate the issue of expensive manual annotation of image-query pairs by generating pseudo-queries. However existing methods are prone to confusing the spatial relationships between objects and rely on designing complex prompt modules to generate query texts which severely impedes the ability to generate accurate and comprehensive queries due to ambiguous spatial relationships and manually-defined fixed templates. To tackle these challenges we propose a omni-directional language query generation approach for unsupervised visual grounding named Omni-Q. Specifically we develop a 3D spatial relation module to extend the 2D spatial representation to 3D thereby utilizing 3D location information to accurately determine the spatial position among objects. Besides we introduce a spatial graph module leveraging the power of graph structures to establish accurate and diverse object relationships and thus enhancing the flexibility of query generation. Extensive experiments on five public benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art unsupervised methods by up to 16.17%. In addition when applied in the supervised setting our method can freely save up to 60% human annotations without a loss of performance.

PDF CVPR Semantic Scholar

Cite

Text

Wang et al. "Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01352

Markdown

[Wang et al. "Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wang2024cvpr-omniq/) doi:10.1109/CVPR52733.2024.01352

BibTeX

@inproceedings{wang2024cvpr-omniq,
  title     = {{Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding}},
  author    = {Wang, Sai and Lin, Yutian and Wu, Yu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {14261-14270},
  doi       = {10.1109/CVPR52733.2024.01352},
  url       = {https://mlanthology.org/cvpr/2024/wang2024cvpr-omniq/}
}