Jointly Training Large Autoregressive Multimodal Models

Abstract

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

Cite

Text

Aiello et al. "Jointly Training Large Autoregressive Multimodal Models." International Conference on Learning Representations, 2024.

Markdown

[Aiello et al. "Jointly Training Large Autoregressive Multimodal Models." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/aiello2024iclr-jointly/)

BibTeX

@inproceedings{aiello2024iclr-jointly,
  title     = {{Jointly Training Large Autoregressive Multimodal Models}},
  author    = {Aiello, Emanuele and Yu, Lili and Nie, Yixin and Aghajanyan, Armen and Oguz, Barlas},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/aiello2024iclr-jointly/}
}