CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Abstract

We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with language for both encoding and generation CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2 we build a large-scale generation dataset encompassing in-context multimodal instructions across text vision and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing exemplar learning composition reasoning etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation vision transformation and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation.

Cite

Text

Tang et al. "CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02589

Markdown

[Tang et al. "CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/tang2024cvpr-codi2/) doi:10.1109/CVPR52733.2024.02589

BibTeX

@inproceedings{tang2024cvpr-codi2,
  title     = {{CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation}},
  author    = {Tang, Zineng and Yang, Ziyi and Khademi, Mahmoud and Liu, Yang and Zhu, Chenguang and Bansal, Mohit},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {27425-27434},
  doi       = {10.1109/CVPR52733.2024.02589},
  url       = {https://mlanthology.org/cvpr/2024/tang2024cvpr-codi2/}
}