Scene Representation in Bird's-Eye View from Surrounding Cameras with Transformers
Abstract
Scene representation in the bird’s-eye-view (BEV) coordinate frame provides a succinct and effective way to understand surrounding environments for autonomous vehicles and robotics. In this work, we present an end-to-end architecture to generate the BEV representation from surrounding cameras. To generate the BEV representation, we propose a transformer-based encoder-decoder structure to translate the image features from different cameras into the BEV frame, which takes advantage of the context information in the individual image and the relationship between images in different views. We perform multiple semantic segmentation tasks using the BEV features. Experimental results show that our model outperforms the competitive baseline [20], which demonstrates the effectiveness and efficiency of our method.
Cite
Text
Zhao et al. "Scene Representation in Bird's-Eye View from Surrounding Cameras with Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022. doi:10.1109/CVPRW56347.2022.00497Markdown
[Zhao et al. "Scene Representation in Bird's-Eye View from Surrounding Cameras with Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022.](https://mlanthology.org/cvprw/2022/zhao2022cvprw-scene/) doi:10.1109/CVPRW56347.2022.00497BibTeX
@inproceedings{zhao2022cvprw-scene,
title = {{Scene Representation in Bird's-Eye View from Surrounding Cameras with Transformers}},
author = {Zhao, Yun and Zhang, Yu and Gong, Zhan and Zhu, Hong},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2022},
pages = {4510-4518},
doi = {10.1109/CVPRW56347.2022.00497},
url = {https://mlanthology.org/cvprw/2022/zhao2022cvprw-scene/}
}