DESTR: Object Detection with Split Transformer

Abstract

Self- and cross-attention in Transformers provide for high model capacity, making them viable models for object detection. However, Transformers still lag in performance behind CNN-based detectors. This is, we believe, because: (a) Cross-attention is used for both classification and bounding-box regression tasks; (b) Transformer's decoder poorly initializes content queries; and (c) Self-attention poorly accounts for certain prior knowledge which could help improve inductive bias. These limitations are addressed with the corresponding three contributions. First, we propose a new Detection Split Transformer (DESTR) that separates estimation of cross-attention into two independent branches -- one tailored for classification and the other for box regression. Second, we use a mini-detector to initialize the content queries in the decoder with classification and regression embeddings of the respective heads in the mini-detector. Third, we augment self-attention in the decoder to additionally account for pairs of adjacent object queries. Our experiments on the MS-COCO dataset show that DESTR outperforms DETR and its successors.

Cite

Text

He and Todorovic. "DESTR: Object Detection with Split Transformer." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00916

Markdown

[He and Todorovic. "DESTR: Object Detection with Split Transformer." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/he2022cvpr-destr/) doi:10.1109/CVPR52688.2022.00916

BibTeX

@inproceedings{he2022cvpr-destr,
  title     = {{DESTR: Object Detection with Split Transformer}},
  author    = {He, Liqiang and Todorovic, Sinisa},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {9377-9386},
  doi       = {10.1109/CVPR52688.2022.00916},
  url       = {https://mlanthology.org/cvpr/2022/he2022cvpr-destr/}
}