Jointly Training Large Autoregressive Multimodal Models
Abstract
In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.
Cite
Text
Aiello et al. "Jointly Training Large Autoregressive Multimodal Models." International Conference on Learning Representations, 2024.Markdown
[Aiello et al. "Jointly Training Large Autoregressive Multimodal Models." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/aiello2024iclr-jointly/)BibTeX
@inproceedings{aiello2024iclr-jointly,
title = {{Jointly Training Large Autoregressive Multimodal Models}},
author = {Aiello, Emanuele and Yu, Lili and Nie, Yixin and Aghajanyan, Armen and Oguz, Barlas},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/aiello2024iclr-jointly/}
}