Reading Recognition in the Wild
Abstract
To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism. Code, model, and data will be public.
Cite
Text
Yang et al. "Reading Recognition in the Wild." Advances in Neural Information Processing Systems, 2025.Markdown
[Yang et al. "Reading Recognition in the Wild." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yang2025neurips-reading/)BibTeX
@inproceedings{yang2025neurips-reading,
title = {{Reading Recognition in the Wild}},
author = {Yang, Charig and Alam, Samiul and Siam, Shakhrul Iman and Proulx, Michael J. and Mathias, Lambert and Somasundaram, Kiran and Pesqueira, Luis and Fort, James and Sheriffdeen, Sheroze and Parkhi, Omkar and Ren, Yuheng and Zhang, Mi and Chai, Yuning and Newcombe, Richard and Kim, Hyo Jin},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/yang2025neurips-reading/}
}