LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models
Abstract
Modeling human scanpaths remains a challenging task due to the complexity of visual attention dynamics. Traditional approaches rely on low-level visual features, but they often fail to capture the semantic and contextual factors that guide human gaze. To address this, we propose a novel method that integrates LLMs and VLMs to enrich scanpath prediction with semantic priors. By leveraging word-level representations extracted through interpretability tools like the logit lens, our approach aligns spatial-temporal gaze patterns with high-level scene semantics. Our method establishes a new state of the art, improving all key scanpath prediction metrics by approximately 15% on average, demonstrating the effectiveness of integrating linguistic and visual knowledge for enhanced gaze modeling.
Cite
Text
Lvov and Pershin. "LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.Markdown
[Lvov and Pershin. "LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models." ICLR 2025 Workshops: LLM_Reason_and_Plan, 2025.](https://mlanthology.org/iclrw/2025/lvov2025iclrw-logitgaze/)BibTeX
@inproceedings{lvov2025iclrw-logitgaze,
title = {{LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models}},
author = {Lvov, Dmitry and Pershin, Ilya},
booktitle = {ICLR 2025 Workshops: LLM_Reason_and_Plan},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/lvov2025iclrw-logitgaze/}
}