Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment
Abstract
Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods by 5%-15% rSum. Our code is available at https://github.com/CrossmodalGroup/LAPS
Cite
Text
Fu et al. "Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02485Markdown
[Fu et al. "Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/fu2024cvpr-linguisticaware/) doi:10.1109/CVPR52733.2024.02485BibTeX
@inproceedings{fu2024cvpr-linguisticaware,
title = {{Linguistic-Aware Patch Slimming Framework for Fine-Grained Cross-Modal Alignment}},
author = {Fu, Zheren and Zhang, Lei and Xia, Hou and Mao, Zhendong},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {26307-26316},
doi = {10.1109/CVPR52733.2024.02485},
url = {https://mlanthology.org/cvpr/2024/fu2024cvpr-linguisticaware/}
}