Learning Multi-Scene Absolute Pose Regression with Transformers
Abstract
Absolute camera pose regression methods estimate the position and orientation of a camera by only using the captured image. A convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this framework was extended for learning multiple scenes with a single model by adding a multi-layer perceptron head per scene. In this work, we propose to learn multi-scene absolute camera pose regression with transformers, where encoders are used to aggregate activation maps with self-attention and deocoders transform latent features into candidate pose predictions in parallel, each associated with a different scene. This formulation allows our model to focus on general features that are informative for localization while embedding multiple scenes at once. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and single-scene absolute pose regressors.
Cite
Text
Shavit et al. "Learning Multi-Scene Absolute Pose Regression with Transformers." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00273Markdown
[Shavit et al. "Learning Multi-Scene Absolute Pose Regression with Transformers." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/shavit2021iccv-learning/) doi:10.1109/ICCV48922.2021.00273BibTeX
@inproceedings{shavit2021iccv-learning,
title = {{Learning Multi-Scene Absolute Pose Regression with Transformers}},
author = {Shavit, Yoli and Ferens, Ron and Keller, Yosi},
booktitle = {International Conference on Computer Vision},
year = {2021},
pages = {2733-2742},
doi = {10.1109/ICCV48922.2021.00273},
url = {https://mlanthology.org/iccv/2021/shavit2021iccv-learning/}
}