Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
Abstract
Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to handle it via prior rules but still are confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction. The data and code for this paper are publicly available.
Cite
Text
Yu et al. "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01971Markdown
[Yu et al. "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/yu2023iccv-visuallyprompted/) doi:10.1109/ICCV51070.2023.01971BibTeX
@inproceedings{yu2023iccv-visuallyprompted,
title = {{Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World}},
author = {Yu, Qifan and Li, Juncheng and Wu, Yu and Tang, Siliang and Ji, Wei and Zhuang, Yueting},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {21560-21571},
doi = {10.1109/ICCV51070.2023.01971},
url = {https://mlanthology.org/iccv/2023/yu2023iccv-visuallyprompted/}
}