UP-DETR: Unsupervised Pre-Training for Object Detection with Transformers

Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen

CVPR 2021 pp. 1601-1610

doi:10.1109/CVPR46437.2021.00165 /cvpr/2021/dai2021cvpr-updetr/

Abstract

Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to Unsupervisedly Pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection. (2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multi-query patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation. Code and pre-training models: https://github.com/dddzg/up-detr.

PDF CVPR Semantic Scholar

Cite

Text

Dai et al. "UP-DETR: Unsupervised Pre-Training for Object Detection with Transformers." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00165

Markdown

[Dai et al. "UP-DETR: Unsupervised Pre-Training for Object Detection with Transformers." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/dai2021cvpr-updetr/) doi:10.1109/CVPR46437.2021.00165

BibTeX

@inproceedings{dai2021cvpr-updetr,
  title     = {{UP-DETR: Unsupervised Pre-Training for Object Detection with Transformers}},
  author    = {Dai, Zhigang and Cai, Bolun and Lin, Yugeng and Chen, Junying},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {1601-1610},
  doi       = {10.1109/CVPR46437.2021.00165},
  url       = {https://mlanthology.org/cvpr/2021/dai2021cvpr-updetr/}
}