ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang

ECCV 2020

doi:10.1007/978-3-030-58610-2_24 /eccv/2020/wang2020eccv-vitaa/

Abstract

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances.

PDF ECCV Semantic Scholar

Cite

Text

Wang et al. "ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58610-2_24

Markdown

[Wang et al. "ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/wang2020eccv-vitaa/) doi:10.1007/978-3-030-58610-2_24

BibTeX

@inproceedings{wang2020eccv-vitaa,
  title     = {{ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language}},
  author    = {Wang, Zhe and Fang, Zhiyuan and Wang, Jun and Yang, Yezhou},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2020},
  doi       = {10.1007/978-3-030-58610-2_24},
  url       = {https://mlanthology.org/eccv/2020/wang2020eccv-vitaa/}
}