Cascade Transformers for End-to-End Person Search
Abstract
The goal of person search is to localize a target person from a gallery set of scene images, which is extremely challenging due to large scale variations, pose/viewpoint changes, and occlusions. In this paper, we propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search. Our three-stage cascade design focuses on detecting people in the first stage, while later stages simultaneously and progressively refine the representation for person detection and re-identification. At each stage the occluded attention transformer applies tighter intersection over union thresholds, forcing the network to learn coarse-to-fine pose/scale invariant features. Meanwhile, we calculate each detection's occluded attention to differentiate a person's tokens from other people or the background. In this way, we simulate the effect of other objects occluding a person of interest at the token-level. Through comprehensive experiments, we demonstrate the benefits of our method by achieving state-of-the-art performance on two benchmark datasets.
Cite
Text
Yu et al. "Cascade Transformers for End-to-End Person Search." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00712Markdown
[Yu et al. "Cascade Transformers for End-to-End Person Search." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/yu2022cvpr-cascade/) doi:10.1109/CVPR52688.2022.00712BibTeX
@inproceedings{yu2022cvpr-cascade,
title = {{Cascade Transformers for End-to-End Person Search}},
author = {Yu, Rui and Du, Dawei and LaLonde, Rodney and Davila, Daniel and Funk, Christopher and Hoogs, Anthony and Clipp, Brian},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {7267-7276},
doi = {10.1109/CVPR52688.2022.00712},
url = {https://mlanthology.org/cvpr/2022/yu2022cvpr-cascade/}
}