Attention-Based Multi-Modal Fusion Network for Semantic Scene Completion
Abstract
This paper presents an end-to-end 3D convolutional network named attention-based multi-modal fusion network (AMFNet) for the semantic scene completion (SSC) task of inferring the occupancy and semantic labels of a volumetric 3D scene from single-view RGB-D images. Compared with previous methods which use only the semantic features extracted from RGB-D images, the proposed AMFNet learns to perform effective 3D scene completion and semantic segmentation simultaneously via leveraging the experience of inferring 2D semantic segmentation from RGB-D images as well as the reliable depth cues in spatial dimension. It is achieved by employing a multi-modal fusion architecture boosted from 2D semantic segmentation and a 3D semantic completion network empowered by residual attention blocks. We validate our method on both the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset and the results show that our method respectively achieves the gains of 2.5% and 2.6% on the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset against the state-of-the-art method.
Cite
Text
Li et al. "Attention-Based Multi-Modal Fusion Network for Semantic Scene Completion." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I07.6803Markdown
[Li et al. "Attention-Based Multi-Modal Fusion Network for Semantic Scene Completion." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/li2020aaai-attention-a/) doi:10.1609/AAAI.V34I07.6803BibTeX
@inproceedings{li2020aaai-attention-a,
title = {{Attention-Based Multi-Modal Fusion Network for Semantic Scene Completion}},
author = {Li, Siqi and Zou, Changqing and Li, Yipeng and Zhao, Xibin and Gao, Yue},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2020},
pages = {11402-11409},
doi = {10.1609/AAAI.V34I07.6803},
url = {https://mlanthology.org/aaai/2020/li2020aaai-attention-a/}
}