Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection
Abstract
Vision transformers have demonstrated remarkable performance on a variety of computer vision tasks. In this paper, we illustrate the effectiveness of the deformable vision transformer for single-stage pedestrian detection and propose a spatial and multi-scale feature enhancement module, which aims to achieve the optimal balance between speed and accuracy. Performance improvement with vision transformers on various commonly used single-stage structures is demonstrated. The design of the proposed architecture is investigated in depth. Comprehensive comparisons with state-of-the-art single- and two-stage detectors on different pedestrian datasets are performed. The proposed detector achieves leading performance on Caltech and Citypersons datasets among single- and two-stage methods using fewer parameters than the baseline. The log-average miss rates for Reasonable and Heavy are decreased to 2.6% and 28.0% on the Caltech test set, and 10.9% and 38.6% on the Citypersons validation set, respectively. The proposed method outperforms SOTA two-stage detectors in the Heavy subset on the Citypersons validation set with considerably faster inference speed.
Cite
Text
Yuan et al. "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection." Neural Information Processing Systems, 2022.Markdown
[Yuan et al. "Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/yuan2022neurips-effectiveness/)BibTeX
@inproceedings{yuan2022neurips-effectiveness,
title = {{Effectiveness of Vision Transformer for Fast and Accurate Single-Stage Pedestrian Detection}},
author = {Yuan, Jing and Barmpoutis, Panagiotis and Stathaki, Tania},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/yuan2022neurips-effectiveness/}
}