DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce **DVL-Suite**, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: **DVL-Bench** and **DVL-Instruct**. The *DVL-Bench* includes six urban understanding tasks, from fundamental change detection (*pixel-level*) to quantitative analyses (*regional-level*) and comprehensive urban narratives (*scene-level*), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of *DVL-Instruct*, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop **DVLChat**, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions. Project: [https://github.com/weihao1115/dynamicvl](https://github.com/weihao1115/dynamicvl).
Cite
Text
Xuan et al. "DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding." Advances in Neural Information Processing Systems, 2025.Markdown
[Xuan et al. "DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/xuan2025neurips-dynamicvl/)BibTeX
@inproceedings{xuan2025neurips-dynamicvl,
title = {{DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding}},
author = {Xuan, Weihao and Wang, Junjue and Qi, Heli and Chen, Zihang and Zheng, Zhuo and Zhong, Yanfei and Xia, Junshi and Yokoya, Naoto},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/xuan2025neurips-dynamicvl/}
}