Towards Flexible Multi-Modal Document Models

Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, Kota Yamaguchi

CVPR 2023 pp. 14287-14296

doi:10.1109/CVPR52729.2023.01373 /cvpr/2023/inoue2023cvpr-flexible/

Abstract

Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines.

PDF CVPR Semantic Scholar

Cite

Text

Inoue et al. "Towards Flexible Multi-Modal Document Models." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01373

Markdown

[Inoue et al. "Towards Flexible Multi-Modal Document Models." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/inoue2023cvpr-flexible/) doi:10.1109/CVPR52729.2023.01373

BibTeX

@inproceedings{inoue2023cvpr-flexible,
  title     = {{Towards Flexible Multi-Modal Document Models}},
  author    = {Inoue, Naoto and Kikuchi, Kotaro and Simo-Serra, Edgar and Otani, Mayu and Yamaguchi, Kota},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {14287-14296},
  doi       = {10.1109/CVPR52729.2023.01373},
  url       = {https://mlanthology.org/cvpr/2023/inoue2023cvpr-flexible/}
}