FIRM: Fusion-Injected Residual Memory Brings Token-Level Alignment to Unsupervised VI-ReID

Abstract

Unsupervised visible-infrared person re-identification (VI-ReID) presents unique challenges due to severe modality discrepancies, including heterogeneous appearance gaps, semantic granularity mismatches, and pseudo-label noise amplification intrinsic to label-free scenarios. We distill these challenges into two core problems: fine-grained semantic alignment, which necessitates explicit token-level cross-modal feature fusion, and memory fragmentation caused by noisy pseudo-label propagation. To address these issues, we propose Fusion-Injected Residual Memory (FIRM), a unified framework that integrates Vision–Semantic Prompt Fusion (VSPF), which injects multi-scale textual cues derived from CLIP and large language models into multiple layers of a vision backbone for token-wise semantic alignment, and Evolving Multi-view Cluster Memory (EMCM), which employs optimal transport–guided clustering and dynamic prototype maintenance to ensure long-term identity consistency. The framework is optimized end-to-end using an optimal transport–weighted InfoNCE loss, a multi-layer alignment regularizer, and geometric cluster regularization, all without reliance on manual annotations. Extensive experiments on benchmark VI-ReID datasets demonstrate that the proposed method substantially advances unsupervised cross-modal retrieval performance, achieving new state-of-the-art results. Ablation studies further verify the independent and synergistic effectiveness of both modules in overcoming the identified core challenges.

Cite

Text

Rong et al. "FIRM: Fusion-Injected Residual Memory Brings Token-Level Alignment to Unsupervised VI-ReID." Proceedings of the 17th Asian Conference on Machine Learning, 2025.

Markdown

[Rong et al. "FIRM: Fusion-Injected Residual Memory Brings Token-Level Alignment to Unsupervised VI-ReID." Proceedings of the 17th Asian Conference on Machine Learning, 2025.](https://mlanthology.org/acml/2025/rong2025acml-firm/)

BibTeX

@inproceedings{rong2025acml-firm,
  title     = {{FIRM: Fusion-Injected Residual Memory Brings Token-Level Alignment to Unsupervised VI-ReID}},
  author    = {Rong, Ze and Shen, Xiaofeng and Qin, Haoyang and Xu, Yue and Li, Hongjun and Ma, Lei},
  booktitle = {Proceedings of the 17th Asian Conference on Machine Learning},
  year      = {2025},
  pages     = {1134-1149},
  volume    = {304},
  url       = {https://mlanthology.org/acml/2025/rong2025acml-firm/}
}