Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models

Dang, Jisheng; Chen, Ligen; Wu, Jingze; Lin, Ronghao; Wang, Bimei; Wang, Yun; Wang, Liting; Zhu, Nannan; Wang, Teng

doi:10.24963/IJCAI.2025/98

Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models

Jisheng Dang, Ligen Chen, Jingze Wu, Ronghao Lin, Bimei Wang, Yun Wang, Liting Wang, Nannan Zhu, Teng Wang

IJCAI 2025 pp. 873-881

doi:10.24963/IJCAI.2025/98 /ijcai/2025/dang2025ijcai-diff/

Abstract

Dynamic spatio-temporal understanding is essential for video-based multimodal tasks, yet existing methods often struggle to capture fine-grained temporal and spatial relationships in long videos. Current approaches primarily rely on pre-trained CLIP encoders, which excel in semantic understanding but lack spatially-aware visual context. This leads to hallucinated results when interpreting fine-grained objects or scenes. To address these limitations, we propose a novel framework that integrates diffusion models into multimodal video models. By employing diffusion encoders at intermediate layers, we enhance visual representations through feature alignment and knowledge distillation losses, significantly improving the model's ability to capture spatial patterns over time. Additionally, we introduce a multi-level alignment strategy to learn robust feature correspondence from pre-trained diffusion models. Extensive experiments on benchmark datasets demonstrate our approach's state-of-the-art performance across multiple video understanding tasks. These results establish diffusion models as a powerful tool for enhancing multimodal video models in complex, dynamic scenarios.

PDF IJCAI Semantic Scholar

Cite

Text

Dang et al. "Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/98

Markdown

[Dang et al. "Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/dang2025ijcai-diff/) doi:10.24963/IJCAI.2025/98

BibTeX

@inproceedings{dang2025ijcai-diff,
  title     = {{Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models}},
  author    = {Dang, Jisheng and Chen, Ligen and Wu, Jingze and Lin, Ronghao and Wang, Bimei and Wang, Yun and Wang, Liting and Zhu, Nannan and Wang, Teng},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {873-881},
  doi       = {10.24963/IJCAI.2025/98},
  url       = {https://mlanthology.org/ijcai/2025/dang2025ijcai-diff/}
}