Hierarchical Encoding Tree with Modality Mixup for Cross-Modal Hashing

Xiao, Zhiping; Luo, Junyu; Zhou, Hang; Zhao, Yusheng; Luo, Xiao; Wang, Pengyun; Ju, Wei; Heng, Siyu; Zhang, Ming

Hierarchical Encoding Tree with Modality Mixup for Cross-Modal Hashing

Zhiping Xiao, Junyu Luo, Hang Zhou, Yusheng Zhao, Xiao Luo, Pengyun Wang, Wei Ju, Siyu Heng, Ming Zhang

ICLR 2026

/iclr/2026/xiao2026iclr-hierarchical/

Abstract

Cross-modal retrieval is a fundamental task that aims to learn semantic correspondences across different data modalities, such as visual and textual modalities. Unsupervised hashing methods can efficiently manage large-scale data and can be effectively applied to cross-modal retrieval studies. However, existing methods typically fail to fully exploit the hierarchical semantic structure within text and image data, where instances naturally organize into multi-level communities of varying granularity. Moreover, the commonly-used direct modal alignment cannot effectively bridge the semantic gap between these two modalities. To address these issues, we introduce a novel Hierarchical Encoding Tree with Modality Mixup (HINT) method, which achieves effective cross-modal retrieval by extracting hierarchical cross-modal relations. HINT constructs a cross-modal encoding tree guided by hierarchical structural entropy and generates proxy samples of text and image modalities for each instance from the encoding tree. Through the curriculum-based mixup of proxy samples, HINT achieves progressive modal alignment and effective cross-modal retrieval. We also conduct cross-modal consistency learning to achieve global-view semantic alignment between text and image representations. Extensive experiments on a range of cross-modal retrieval datasets demonstrate the superiority of HINT over state-of-the-art methods.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Xiao et al. "Hierarchical Encoding Tree with Modality Mixup for Cross-Modal Hashing." International Conference on Learning Representations, 2026.

Markdown

[Xiao et al. "Hierarchical Encoding Tree with Modality Mixup for Cross-Modal Hashing." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xiao2026iclr-hierarchical/)

BibTeX

@inproceedings{xiao2026iclr-hierarchical,
  title     = {{Hierarchical Encoding Tree with Modality Mixup for Cross-Modal Hashing}},
  author    = {Xiao, Zhiping and Luo, Junyu and Zhou, Hang and Zhao, Yusheng and Luo, Xiao and Wang, Pengyun and Ju, Wei and Heng, Siyu and Zhang, Ming},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xiao2026iclr-hierarchical/}
}