Hybrid-Tower: Fine-Grained Pseudo-Query Interaction and Generation for Text-to-Video Retrieval
Abstract
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, i.e. \name , which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of 1.6% ~ 3.9% in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.
Cite
Text
Lan et al. "Hybrid-Tower: Fine-Grained Pseudo-Query Interaction and Generation for Text-to-Video Retrieval." International Conference on Computer Vision, 2025.Markdown
[Lan et al. "Hybrid-Tower: Fine-Grained Pseudo-Query Interaction and Generation for Text-to-Video Retrieval." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/lan2025iccv-hybridtower/)BibTeX
@inproceedings{lan2025iccv-hybridtower,
title = {{Hybrid-Tower: Fine-Grained Pseudo-Query Interaction and Generation for Text-to-Video Retrieval}},
author = {Lan, Bangxiang and Xie, Ruobing and Zhao, Ruixiang and Sun, Xingwu and Kang, Zhanhui and Yang, Gang and Li, Xirong},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {24497-24506},
url = {https://mlanthology.org/iccv/2025/lan2025iccv-hybridtower/}
}