Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Abstract

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval where joint-encoder rerankers are standard, comparable vision–language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE , an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image–text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.

Cite

Text

Taraday et al. "Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking." International Conference on Learning Representations, 2026.

Markdown

[Taraday et al. "Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/taraday2026iclr-efficient/)

BibTeX

@inproceedings{taraday2026iclr-efficient,
  title     = {{Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking}},
  author    = {Taraday, Mitchell Keren and Wagner, Shahaf and Baskin, Chaim},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/taraday2026iclr-efficient/}
}