Enhancing Cross-Modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval

Gong, Tiantian; Wang, Junsheng; Zhang, Liyan

doi:10.24963/ijcai.2024/88

Enhancing Cross-Modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval

Tiantian Gong, Junsheng Wang, Liyan Zhang

IJCAI 2024 pp. 794-802

doi:10.24963/ijcai.2024/88 /ijcai/2024/gong2024ijcai-enhancing/

Abstract

Egocentric object-interaction anticipation is critical for applications like augmented reality and robotics, but existing methods struggle with misaligned egocentric encoding, insufficient supervision, and underutilized historical context. These limitations stem from a lack of focus on retention, i.e., retaining long-term object-centric interactions, and prediction, i.e., future-centric encoding and future uncertainty modeling. We introduce EgoAnticipator, a novel Retentive and Predictive Learning framework that addresses these challenges. Our approach combines retentive pre-training for domain-specific encoding, predictive pre-training for future uncertainty modeling, and mirror distillation to transfer future-informed knowledge. Additionally, we propose long-term memory prompting to integrate historical interaction cues. We evaluate the effectiveness of our framework using the Ego4D short-term object interaction anticipation benchmark, covering both STAv1 and STAv2. Extensive experiments demonstrate that our framework outperforms existing methods, while ablation studies highlight the effectiveness of each design inside our retentive and predictive learning framework.

PDF IJCAI Semantic Scholar

Cite

Text

Gong et al. "Enhancing Cross-Modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/88

Markdown

[Gong et al. "Enhancing Cross-Modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/gong2024ijcai-enhancing/) doi:10.24963/ijcai.2024/88

BibTeX

@inproceedings{gong2024ijcai-enhancing,
  title     = {{Enhancing Cross-Modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval}},
  author    = {Gong, Tiantian and Wang, Junsheng and Zhang, Liyan},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {794-802},
  doi       = {10.24963/ijcai.2024/88},
  url       = {https://mlanthology.org/ijcai/2024/gong2024ijcai-enhancing/}
}