Vision-Guided Text Mining for Unsupervised Cross-Modal Hashing with Community Similarity Quantization
Abstract
Cross-modal retrieval, as an emerging field within multimedia research, has gained significant attention in recent years. Unsupervised cross-modal hashing methods are attractive due to their ability to capture latent relationships within the data without label supervision and to produce compact hash codes for high search efficiency. However, the text modality exhibits worse representation ability compared with the image modality, leading to weak guidance to construct the joint similarity matrix. Moreover, most unsupervised cross-modal hashing methods are based on pairwise similarities for training, resulting in non-aggregating data distribution in the hash space. In this paper, we propose a novel Vision-guided Text Mining for Unsupervised Cross-modal Hashing via Community Similarity Quantization, termed VTM-UCH. Specifically, we first find the one-to-one correspondence between each word and each vision (image or object) based on the Contrastive Language-Image Pre-training (CLIP) model and compute the text similarities according to the clustering of their corresponding visions. Then, we define the fine-grained object-level image similarities and design the joint similarity matrix based on the text and image similarities. Accordingly, we construct an undirected graph to compute the communities as the pseudo-centers and adjust the pairwise similarities to improve the hash codes distribution. The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines.
Cite
Text
Fan and Cao. "Vision-Guided Text Mining for Unsupervised Cross-Modal Hashing with Community Similarity Quantization." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I3.32290Markdown
[Fan and Cao. "Vision-Guided Text Mining for Unsupervised Cross-Modal Hashing with Community Similarity Quantization." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/fan2025aaai-vision/) doi:10.1609/AAAI.V39I3.32290BibTeX
@inproceedings{fan2025aaai-vision,
title = {{Vision-Guided Text Mining for Unsupervised Cross-Modal Hashing with Community Similarity Quantization}},
author = {Fan, Haozhi and Cao, Yuan},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {2843-2851},
doi = {10.1609/AAAI.V39I3.32290},
url = {https://mlanthology.org/aaai/2025/fan2025aaai-vision/}
}