Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability
Abstract
Understanding what makes a video memorable has important applications in advertising and education technology. Towards this goal we investigate spatio-temporal attention mechanisms underlying video memorability. Different from previous works that fuse multiple features we adopt a simple CNN+Transformer architecture that enables analysis of spatio-temporal attention while matching state-of-the-art (SoTA) performance on video memorability prediction. We compare model attention against human gaze fixations collected through a small-scale eye-tracking study where humans perform the video memory task. We uncover the following insights: (i) Quantitative saliency metrics show that our model trained only to predict a memorability score exhibits similar spatial attention patterns to human gaze especially for more memorable videos. (ii) The model assigns greater importance to initial frames in a video mimicking human attention patterns. (iii) Panoptic segmentation reveals that both (model and humans) assign a greater share of attention to things and less attention to stuff as compared to their occurrence probability.
Cite
Text
Kumar et al. "Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Kumar et al. "Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/kumar2025wacv-seeing/)BibTeX
@inproceedings{kumar2025wacv-seeing,
title = {{Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability}},
author = {Kumar, Prajneya and Khandelwal, Eshika and Tapaswi, Makarand and Sreekumar, Vishnu},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {2082-2091},
url = {https://mlanthology.org/wacv/2025/kumar2025wacv-seeing/}
}