EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Abstract
Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models (MLLMs), which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand–object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning (RFT) to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks.
Cite
Text
Pei et al. "EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT." Advances in Neural Information Processing Systems, 2025.Markdown
[Pei et al. "EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/pei2025neurips-egothinker/)BibTeX
@inproceedings{pei2025neurips-egothinker,
title = {{EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT}},
author = {Pei, Baoqi and Huang, Yifei and Xu, Jilan and He, Yuping and Chen, Guo and Wu, Fei and Pang, Jiangmiao and Qiao, Yu},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/pei2025neurips-egothinker/}
}