Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger

ICLRW 2025

/iclrw/2025/zhang2025iclrw-euclid/

Abstract

Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP)---particularly the ability to accurately describe the geometric details of an image. In this paper, we first demonstrate this limitation by introducing Geoperception, a benchmark designed to evaluate an MLLM’s ability to accurately transcribe 2D geometric information from an image. We then conduct a comprehensive empirical study to explore strategies for improving LLVP performance through the use of synthetic high-fidelity visual description data. Our findings highlight the benefits of certain model architectures and training techniques, including the use of CNN-based visual encoders and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Lastly, we develop \emph{Euclid}, a family of models specifically optimized for strong low-level geometric perception. Although trained on synthetic multimodal data, Euclid shows strong generalization ability on novel real-world geometry shapes. For instance, Euclid outperforms the best closed-source model in our benchmark by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions." ICLR 2025 Workshops: SynthData, 2025.

Markdown

[Zhang et al. "Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions." ICLR 2025 Workshops: SynthData, 2025.](https://mlanthology.org/iclrw/2025/zhang2025iclrw-euclid/)

BibTeX

@inproceedings{zhang2025iclrw-euclid,
  title     = {{Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions}},
  author    = {Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie},
  booktitle = {ICLR 2025 Workshops: SynthData},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/zhang2025iclrw-euclid/}
}