DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving
Abstract
Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision--3D point cloud forecasting, 2D semantic representation, and image generation--to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.
Cite
Text
Shi et al. "DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving." International Conference on Computer Vision, 2025.Markdown
[Shi et al. "DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/shi2025iccv-drivex/)BibTeX
@inproceedings{shi2025iccv-drivex,
title = {{DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving}},
author = {Shi, Chen and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {28599-28609},
url = {https://mlanthology.org/iccv/2025/shi2025iccv-drivex/}
}