LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Abstract
Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.
Cite
Text
Zhou and Lee. "LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs." International Conference on Computer Vision, 2025.Markdown
[Zhou and Lee. "LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhou2025iccv-llafea/)BibTeX
@inproceedings{zhou2025iccv-llafea,
title = {{LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs}},
author = {Zhou, Hanyu and Lee, Gim Hee},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {22294-22304},
url = {https://mlanthology.org/iccv/2025/zhou2025iccv-llafea/}
}