Explore and Tell: Embodied Visual Captioning in 3D Environments
Abstract
While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.
Cite
Text
Hu et al. "Explore and Tell: Embodied Visual Captioning in 3D Environments." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00235Markdown
[Hu et al. "Explore and Tell: Embodied Visual Captioning in 3D Environments." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/hu2023iccv-explore/) doi:10.1109/ICCV51070.2023.00235BibTeX
@inproceedings{hu2023iccv-explore,
title = {{Explore and Tell: Embodied Visual Captioning in 3D Environments}},
author = {Hu, Anwen and Chen, Shizhe and Zhang, Liang and Jin, Qin},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {2482-2491},
doi = {10.1109/ICCV51070.2023.00235},
url = {https://mlanthology.org/iccv/2023/hu2023iccv-explore/}
}