Efficient Decoder-Free Object Detection with Transformers
Abstract
Vision transformers (ViTs) are changing the landscape of object detection tasks. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is simple yet brings an enormous computation burden during inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection could not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages for the first time. We simplify objection detection to an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT{SMALL} outperforms DETR by 2.5% AP with 28% computation cost reduction and more than 10X fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT{SMALL} obtains over 5.5% AP gain while cutting down 70% computation cost.
Cite
Text
Chen et al. "Efficient Decoder-Free Object Detection with Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20080-9_5Markdown
[Chen et al. "Efficient Decoder-Free Object Detection with Transformers." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/chen2022eccv-efficient-a/) doi:10.1007/978-3-031-20080-9_5BibTeX
@inproceedings{chen2022eccv-efficient-a,
title = {{Efficient Decoder-Free Object Detection with Transformers}},
author = {Chen, Peixian and Zhang, Mengdan and Shen, Yunhang and Sheng, Kekai and Gao, Yuting and Sun, Xing and Li, Ke and Shen, Chunhua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-20080-9_5},
url = {https://mlanthology.org/eccv/2022/chen2022eccv-efficient-a/}
}