Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Wang, Bingchao; Ning, Zhiwei; Ding, Jianyu; Gao, Xuanang; Li, Yin; Jiang, Dongsheng; Yang, Jie; Liu, Wei

Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu

ICCV 2025 pp. 20694-20704

/iccv/2025/wang2025iccv-fixclip/

Abstract

CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs (>77 tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.

PDF ICCV Semantic Scholar

Cite

Text

Wang et al. "Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text." International Conference on Computer Vision, 2025.

Markdown

[Wang et al. "Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/wang2025iccv-fixclip/)

BibTeX

@inproceedings{wang2025iccv-fixclip,
  title     = {{Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text}},
  author    = {Wang, Bingchao and Ning, Zhiwei and Ding, Jianyu and Gao, Xuanang and Li, Yin and Jiang, Dongsheng and Yang, Jie and Liu, Wei},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20694-20704},
  url       = {https://mlanthology.org/iccv/2025/wang2025iccv-fixclip/}
}