PLIP: Language-Image Pre-Training for Person Representation Learning

Zuo, Jialong; Hong, Jiahao; Zhang, Feng; Yu, Changqian; Zhou, Hanyu; Gao, Changxin; Sang, Nong; Wang, Jingdong

doi:10.52202/079017-1452

PLIP: Language-Image Pre-Training for Person Representation Learning

Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, Jingdong Wang

NeurIPS 2024

doi:10.52202/079017-1452 /neurips/2024/zuo2024neurips-plip/

Abstract

Language-image pre-training is an effective technique for learning powerful representations in general domains. However, when directly turning to person representation learning, these general pre-training methods suffer from unsatisfactory performance. The reason is that they neglect critical person-related characteristics, i.e., fine-grained attributes and identities. To address this issue, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. Specifically, we elaborately design three pretext tasks: 1) Text-guided Image Colorization, aims to establish the correspondence between the person-related image regions and the fine-grained color-part textual phrases. 2) Image-guided Attributes Prediction, aims to mine fine-grained attribute information of the person body in the image; and 3) Identity-based Vision-Language Contrast, aims to correlate the cross-modal representations at the identity level rather than the instance level. Moreover, to implement our pre-train framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES by automatically generating textual annotations. We pre-train PLIP on SYNTH-PEDES and evaluate our models by spanning downstream person-centric tasks. PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings. The code, dataset and weight will be made publicly available.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zuo et al. "PLIP: Language-Image Pre-Training for Person Representation Learning." Neural Information Processing Systems, 2024. doi:10.52202/079017-1452

Markdown

[Zuo et al. "PLIP: Language-Image Pre-Training for Person Representation Learning." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zuo2024neurips-plip/) doi:10.52202/079017-1452

BibTeX

@inproceedings{zuo2024neurips-plip,
  title     = {{PLIP: Language-Image Pre-Training for Person Representation Learning}},
  author    = {Zuo, Jialong and Hong, Jiahao and Zhang, Feng and Yu, Changqian and Zhou, Hanyu and Gao, Changxin and Sang, Nong and Wang, Jingdong},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-1452},
  url       = {https://mlanthology.org/neurips/2024/zuo2024neurips-plip/}
}