Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment

Zheren Fu, Lei Zhang, Hou Xia, Zhendong Mao

CVPR 2024 pp. 26307-26316

doi:10.1109/CVPR52733.2024.02485 /cvpr/2024/fu2024cvpr-linguisticaware/

Abstract

Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods by 5%-15% rSum. Our code is available at https://github.com/CrossmodalGroup/LAPS

PDF CVPR Semantic Scholar

Cite

Text

Fu et al. "Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02485

Markdown

[Fu et al. "Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/fu2024cvpr-linguisticaware/) doi:10.1109/CVPR52733.2024.02485

BibTeX

@inproceedings{fu2024cvpr-linguisticaware,
  title     = {{Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment}},
  author    = {Fu, Zheren and Zhang, Lei and Xia, Hou and Mao, Zhendong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {26307-26316},
  doi       = {10.1109/CVPR52733.2024.02485},
  url       = {https://mlanthology.org/cvpr/2024/fu2024cvpr-linguisticaware/}
}