Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Abstract
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries - composed of both an image and a text - and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT.
Cite
Text
Caffagni et al. "Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00867Markdown
[Caffagni et al. "Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/caffagni2025cvpr-recurrenceenhanced/) doi:10.1109/CVPR52734.2025.00867BibTeX
@inproceedings{caffagni2025cvpr-recurrenceenhanced,
title = {{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
author = {Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {9286-9295},
doi = {10.1109/CVPR52734.2025.00867},
url = {https://mlanthology.org/cvpr/2025/caffagni2025cvpr-recurrenceenhanced/}
}