Optimal Transport Aggregation for Visual Place Recognition
Abstract
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone in order to form a global descriptor for each image. In this context we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors) which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a dustbin cluster designed to selectively discard features deemed non-informative enhancing the overall descriptor quality. Additionally we leverage and fine-tune DINOv2 as a backbone which provides enhanced description power for the local features and dramatically reduces the required training time. As a result our single-stage method not only surpasses single-stage baselines in public VPR datasets but also surpasses two-stage methods that add a re-ranking with significantly higher cost.
Cite
Text
Izquierdo and Civera. "Optimal Transport Aggregation for Visual Place Recognition." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01672Markdown
[Izquierdo and Civera. "Optimal Transport Aggregation for Visual Place Recognition." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/izquierdo2024cvpr-optimal/) doi:10.1109/CVPR52733.2024.01672BibTeX
@inproceedings{izquierdo2024cvpr-optimal,
title = {{Optimal Transport Aggregation for Visual Place Recognition}},
author = {Izquierdo, Sergio and Civera, Javier},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {17658-17668},
doi = {10.1109/CVPR52733.2024.01672},
url = {https://mlanthology.org/cvpr/2024/izquierdo2024cvpr-optimal/}
}