Visually Descriptive Language Model for Vector Graphics Reasoning

Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, Heng Ji

TMLR 2025

/tmlr/2025/wang2025tmlr-visually/

Abstract

Despite significant advancements, current large multimodal models (LMMs) struggle to bridge the gap between low-level visual perception—focusing on shapes, sizes, and layouts—and high-level language reasoning involving semantics, events, and logic. This limitation becomes evident in tasks requiring precise visual perception, such as comparing geometric properties or solving visual algorithmic reasoning problems. To study this failure mode, we focus on an important visual domain: vector graphics —images composed purely of 2D objects and shapes, which are prevalent in Web, PC, and Mobile environments. Importantly, we consider rasterized vector graphics without assuming access to their underlying vector code. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To accurately capture low-level visual details, we explore using SVG for the precise encoding of visual scenes. However, SVGs are not readily interpretable by LLMs or LMMs in a zero-shot manner. To address this challenge, we propose the Visually Descriptive Language Model (VDLM) to build a bridge between low-level visual perception and high-level language reasoning. VDLM learns an intermediate symbolic representation called Primal Visual Description (PVD), which translates raw SVGs into a higher-level abstraction comprising primitive attributes. This abstraction allows for direct interpretation by foundation models for zero-shot generalization to different reasoning tasks. Without any human-annotated data, VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM’s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes. As the first attempt to construct a descriptive intermediate representation for low-level visual reasoning, we also conduct an in-depth error analysis, highlighting remaining limitations and suggesting directions for future research.

PDF TMLR Semantic Scholar

Cite

Text

Wang et al. "Visually Descriptive Language Model for Vector Graphics Reasoning." Transactions on Machine Learning Research, 2025.

Markdown

[Wang et al. "Visually Descriptive Language Model for Vector Graphics Reasoning." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/wang2025tmlr-visually/)

BibTeX

@article{wang2025tmlr-visually,
  title     = {{Visually Descriptive Language Model for Vector Graphics Reasoning}},
  author    = {Wang, Zhenhailong and Hsu, Joy and Wang, Xingyao and Huang, Kuan-Hao and Li, Manling and Wu, Jiajun and Ji, Heng},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/wang2025tmlr-visually/}
}