See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Shu, Xiujun; Wen, Wei; Wu, Haoqian; Chen, Keyu; Song, Yiran; Qiao, Ruizhi; Ren, Bo; Wang, Xiao

doi:10.1007/978-3-031-25072-9_42

See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval

Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang

ECCVW 2022 pp. 624-641

doi:10.1007/978-3-031-25072-9_42 /eccvw/2022/shu2022eccvw-see/

Abstract

Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an I mplicit V isual- T extual ( IVT ) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine more semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT .

PDF ECCVW Semantic Scholar

Cite

Text

Shu et al. "See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25072-9_42

Markdown

[Shu et al. "See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/shu2022eccvw-see/) doi:10.1007/978-3-031-25072-9_42

BibTeX

@inproceedings{shu2022eccvw-see,
  title     = {{See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval}},
  author    = {Shu, Xiujun and Wen, Wei and Wu, Haoqian and Chen, Keyu and Song, Yiran and Qiao, Ruizhi and Ren, Bo and Wang, Xiao},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2022},
  pages     = {624-641},
  doi       = {10.1007/978-3-031-25072-9_42},
  url       = {https://mlanthology.org/eccvw/2022/shu2022eccvw-see/}
}