Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Abstract

DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues we propose hierarchical salience filtering refinement which performs transformer encoding only on filtered discriminative queries for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements the proposed Salience DETR achieves significant improvements of +4.0% AP +0.2% AP +4.4% AP on three challenging task-specific detection datasets as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.

Cite

Text

Hou et al. "Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01664

Markdown

[Hou et al. "Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/hou2024cvpr-salience/) doi:10.1109/CVPR52733.2024.01664

BibTeX

@inproceedings{hou2024cvpr-salience,
  title     = {{Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement}},
  author    = {Hou, Xiuquan and Liu, Meiqin and Zhang, Senlin and Wei, Ping and Chen, Badong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {17574-17583},
  doi       = {10.1109/CVPR52733.2024.01664},
  url       = {https://mlanthology.org/cvpr/2024/hou2024cvpr-salience/}
}