RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
Abstract
Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.
Cite
Text
Kurita et al. "RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01396Markdown
[Kurita et al. "RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/kurita2023iccv-refego/) doi:10.1109/ICCV51070.2023.01396BibTeX
@inproceedings{kurita2023iccv-refego,
title = {{RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D}},
author = {Kurita, Shuhei and Katsura, Naoki and Onami, Eri},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {15214-15224},
doi = {10.1109/ICCV51070.2023.01396},
url = {https://mlanthology.org/iccv/2023/kurita2023iccv-refego/}
}