Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-Identification

Abstract

Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.

Cite

Text

Zhao et al. "Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-Identification." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I7.28585

Markdown

[Zhao et al. "Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-Identification." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zhao2024aaai-unifying/) doi:10.1609/AAAI.V38I7.28585

BibTeX

@inproceedings{zhao2024aaai-unifying,
  title     = {{Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-Identification}},
  author    = {Zhao, Zhiwei and Liu, Bin and Lu, Yan and Chu, Qi and Yu, Nenghai},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {7534-7542},
  doi       = {10.1609/AAAI.V38I7.28585},
  url       = {https://mlanthology.org/aaai/2024/zhao2024aaai-unifying/}
}