X-DETR: A Versatile Architecture for Instance-Wise Vision-Language Tasks

Abstract

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at 20 frames per second without using any LVIS annotation during training. The code is available at https://github.com/amazon-research/cross-modal-detr.

Cite

Text

Cai et al. "X-DETR: A Versatile Architecture for Instance-Wise Vision-Language Tasks." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20059-5_17

Markdown

[Cai et al. "X-DETR: A Versatile Architecture for Instance-Wise Vision-Language Tasks." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/cai2022eccv-xdetr/) doi:10.1007/978-3-031-20059-5_17

BibTeX

@inproceedings{cai2022eccv-xdetr,
  title     = {{X-DETR: A Versatile Architecture for Instance-Wise Vision-Language Tasks}},
  author    = {Cai, Zhaowei and Kwon, Gukyeong and Ravichandran, Avinash and Bas, Erhan and Tu, Zhuowen and Bhotika, Rahul and Soatto, Stefano},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20059-5_17},
  url       = {https://mlanthology.org/eccv/2022/cai2022eccv-xdetr/}
}