ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model

Abstract

Recent advances in Bird’s Eye View (BEV) layout estimation have advanced through refinements in architectural and geometric design. However, existing methods often overlook the structured relationships among traffic elements. Components such as drivable areas, lane dividers, and pedestrian crossings constitute an interdependent system governed by civil engineering standards. For instance, stop lines precede crosswalks, which align with sidewalks, while lane dividers follow road curvature. To capture these interdependencies, we propose \textbf{ARINBEV}, an autoregressive model for BEV map estimation. Unlike prior generative approaches that rely on complex multiphase training or encoder-decoder architectures, ARINBEV employs a single-stage, decoder-only autoregressive design. This architecture enables semantically consistent BEV map estimation. On nuScenes and Argoverse2, ARINBEV attains 64.3 and 65.6 mIoU, respectively, while using $1.7\times$ fewer parameters and training $1.8\times$ faster than state-of-the-art models.

Cite

Text

Kwag et al. "ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model." International Conference on Learning Representations, 2026.

Markdown

[Kwag et al. "ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kwag2026iclr-arinbev/)

BibTeX

@inproceedings{kwag2026iclr-arinbev,
  title     = {{ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model}},
  author    = {Kwag, Jiyong and Toth, Charles and Yilmaz, Alper},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kwag2026iclr-arinbev/}
}