World4Drive: End-to-End Autonomous Driving via Intention-Aware Physical Latent World Model

Abstract

End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.0% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.

Cite

Text

Zheng et al. "World4Drive: End-to-End Autonomous Driving via Intention-Aware Physical Latent World Model." International Conference on Computer Vision, 2025.

Markdown

[Zheng et al. "World4Drive: End-to-End Autonomous Driving via Intention-Aware Physical Latent World Model." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zheng2025iccv-world4drive/)

BibTeX

@inproceedings{zheng2025iccv-world4drive,
  title     = {{World4Drive: End-to-End Autonomous Driving via Intention-Aware Physical Latent World Model}},
  author    = {Zheng, Yupeng and Yang, Pengxuan and Xing, Zebin and Zhang, Qichao and Zheng, Yuhang and Gao, Yinfeng and Li, Pengfei and Zhang, Teng and Xia, Zhongpu and Jia, Peng and Lang, XianPeng and Zhao, Dongbin},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {28632-28642},
  url       = {https://mlanthology.org/iccv/2025/zheng2025iccv-world4drive/}
}