SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain

Abstract

In scientific publications, a substantial part of the information is expressed via figures containing images and diagrams. Hence, the retrieval of relevant figures given a natural language query is an important real-world task. However, due to the lack of training and evaluation data, most existing approaches are either limited to one modality or focus on non-scientific domains, making their application to scientific publications challenging. In this paper, we address this gap by introducing two novel datasets: (1) SciOL, the largest openly-licensed pre-training corpus for multimodal models in the scientific domain, covering multiple sciences including materials science, physics, and computer science, and (2) MuLMS-Img, a high-quality dataset in the materials science domain, manually annotated for various image-text tasks. Our experiments show that pre-training large-scale vision-language models on SciOL increases performance considerably across a broad variety of image-text tasks including figure type classification, optical character recognition, captioning, and figure retrieval. Using MuLMS-Img, we show that integrating text-based features extracted via a fine-tuned model for a specific domain can boost cross-modal scientific figure retrieval performance by up to 50%.

Cite

Text

Tarsi et al. "SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Tarsi et al. "SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/tarsi2024wacv-sciol/)

BibTeX

@inproceedings{tarsi2024wacv-sciol,
  title     = {{SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain}},
  author    = {Tarsi, Tim and Adel, Heike and Metzen, Jan Hendrik and Zhang, Dan and Finco, Matteo and Friedrich, Annemarie},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {4560-4571},
  url       = {https://mlanthology.org/wacv/2024/tarsi2024wacv-sciol/}
}