RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction
Abstract
The multi-modal 3D semantic occupancy task provides a comprehensive understanding of the scene and has received considerable attention in the field of autonomous driving. However, existing methods mainly focus on processing large-scale voxels, which bring high computational costs and degrade details. Additionally, they struggle to accurately capture occluded targets and distant information. In this paper, we propose a novel LiDAR-Camera 3D semantic occupancy prediction framework called RIOcc, with collaborative feature refinement and multi-scale cross-modal fusion transformer. Specifically, RIOcc encodes multi-modal data into a unified Bird's Eye View (BEV) space, which reduces computational complexity and enhances the efficiency of feature alignment. Then, multi-scale feature processing substantially expands the receptive fields. Meanwhile, in the LiDAR branch, we design the Dual-branch Pooling (DBP) to adaptively enhance geometric features across both the Channel and Grid dimensions. In the camera branch, the Wavelet and Semantic Encoders are developed to extract high-level semantic features with abundant edge and structural information. Finally, to facilitate effective cross-modal complementarity, we develop the Deformable Dual-Attention (DDA) module. Extensive experiments demonstrate that RIOcc achieves state-of-the-art performance, with 54.2 mIoU and 25.9 mIoU on the Occ3D-nuScenes and nuScenes-Occupancy datasets, respectively.
Cite
Text
Fan et al. "RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction." International Conference on Computer Vision, 2025.Markdown
[Fan et al. "RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/fan2025iccv-riocc/)BibTeX
@inproceedings{fan2025iccv-riocc,
title = {{RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction}},
author = {Fan, Baojie and Li, Xiaotian and Zhou, Yuhan and Jiang, Yuyu and Tian, Jiandong and Fan, Huijie},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {25851-25861},
url = {https://mlanthology.org/iccv/2025/fan2025iccv-riocc/}
}