Generating Multimodal Driving Scenes via Next-Scene Prediction

Abstract

Generative models in Autonomous Driving (AD) enable diverse scenario creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of the map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality, while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these two modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/

Cite

Text

Wu et al. "Generating Multimodal Driving Scenes via Next-Scene Prediction." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00642

Markdown

[Wu et al. "Generating Multimodal Driving Scenes via Next-Scene Prediction." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wu2025cvpr-generating/) doi:10.1109/CVPR52734.2025.00642

BibTeX

@inproceedings{wu2025cvpr-generating,
  title     = {{Generating Multimodal Driving Scenes via Next-Scene Prediction}},
  author    = {Wu, Yanhao and Zhang, Haoyang and Lin, Tianwei and Huang, Lichao and Luo, Shujie and Wu, Rui and Qiu, Congpei and Ke, Wei and Zhang, Tong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {6844-6853},
  doi       = {10.1109/CVPR52734.2025.00642},
  url       = {https://mlanthology.org/cvpr/2025/wu2025cvpr-generating/}
}