TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Chen, Hanning; Huang, Wenjun; Ni, Yang; Yun, Sanggeon; Liu, Yezi; Wen, Fei; Velasquez, Alvaro; Latapie, Hugo; Imani, Mohsen

doi:10.1007/978-3-031-91907-7_24

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen, Alvaro Velasquez, Hugo Latapie, Mohsen Imani

ECCVW 2024 pp. 401-418

doi:10.1007/978-3-031-91907-7_24 /eccvw/2024/chen2024eccvw-taskclip/

Abstract

Task-oriented object detection aims to find suitable objects for performing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-reasoning object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts. Nevertheless, the naive application of VLMs leads to suboptimal quality, due to the misalignment between embeddings of object images and their visual attributes, which are mainly adjective phrases. To this end, we design a transformer-based aligner after the pre-trained VLMs to recalibrate both embeddings. Finally, we employ a trainable score function to post-process the VLM matching results for object selection. Experimental results demonstrate that our TaskCLIP outperforms the DETR-based model TOIST in both accuracy ( $+6.2\%$ + 6.2 % ) and efficiency.

PDF ECCVW Semantic Scholar

Cite

Text

Chen et al. "TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91907-7_24

Markdown

[Chen et al. "TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/chen2024eccvw-taskclip/) doi:10.1007/978-3-031-91907-7_24

BibTeX

@inproceedings{chen2024eccvw-taskclip,
  title     = {{TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection}},
  author    = {Chen, Hanning and Huang, Wenjun and Ni, Yang and Yun, Sanggeon and Liu, Yezi and Wen, Fei and Velasquez, Alvaro and Latapie, Hugo and Imani, Mohsen},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {401-418},
  doi       = {10.1007/978-3-031-91907-7_24},
  url       = {https://mlanthology.org/eccvw/2024/chen2024eccvw-taskclip/}
}