Translating Images into Maps (Extended Abstract)

Abstract

We approach instantaneous mapping, converting images to a top-down view of the world, as a translation problem. We show how a novel form of transformer network can be used to map from images and video directly to an overhead map or bird's-eye-view (BEV) of the world, in a single end-to-end network. We assume a 1-1 correspondence between a vertical scanline in the image, and rays passing through the camera location in an overhead map. This lets us formulate map generation from an image as a set of sequence-to-sequence translations. This constrained formulation, based upon a strong physical grounding of the problem, leads to a restricted transformer network that is convolutional in the horizontal direction only. The structure allows us to make efficient use of data when training, and obtains state-of-the-art results for instantaneous mapping of three large-scale datasets, including a 15\% and 30\% relative gain against existing best performing methods on the nuScenes and Argoverse datasets, respectively.

Cite

Text

Saha et al. "Translating Images into Maps (Extended Abstract)." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/725

Markdown

[Saha et al. "Translating Images into Maps (Extended Abstract)." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/saha2023ijcai-translating/) doi:10.24963/IJCAI.2023/725

BibTeX

@inproceedings{saha2023ijcai-translating,
  title     = {{Translating Images into Maps (Extended Abstract)}},
  author    = {Saha, Avishkar and Mendez, Oscar and Russell, Chris and Bowden, Richard},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {6486-6491},
  doi       = {10.24963/IJCAI.2023/725},
  url       = {https://mlanthology.org/ijcai/2023/saha2023ijcai-translating/}
}