Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
Abstract
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can generate text image or audio outputs which is accomplished in a unified way by tokenizing these different inputs and outputs into a shared semantic space that can then be processed by a single encoder-decoder transformer model. Unified-IO 2 is trained from scratch on a custom-built multimodal pre-training corpus and then learns an expansive set of skills through fine-tuning on over 120 datasets including datasets for segmentation object detection image editing audio localization video tracking embodied AI and 3D detection. To facilitate instruction-following we add prompts and other data augmentations to these tasks to allow Unified-IO 2 to generalize these skills to new tasks zero-shot. Unified-IO 2 is the first model to be trained on such a diverse and wide-reaching set of skills and unify three separate generation capabilities. Unified-IO 2 achieves state-of-the-art performance on the multi-task GRIT benchmark and achieves strong results on 30 diverse datasets including SEED-Bench image and video understanding TIFA image generation VQA 2.0 ScienceQA VIMA robotic manipulation VGG-Sound and Kinetics-Sounds and can perform unseen tasks and generate free-form responses. We release our model and code to facilitate future work.
Cite
Text
Lu et al. "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02497Markdown
[Lu et al. "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/lu2024cvpr-unifiedio/) doi:10.1109/CVPR52733.2024.02497BibTeX
@inproceedings{lu2024cvpr-unifiedio,
title = {{Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action}},
author = {Lu, Jiasen and Clark, Christopher and Lee, Sangho and Zhang, Zichen and Khosla, Savya and Marten, Ryan and Hoiem, Derek and Kembhavi, Aniruddha},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {26439-26455},
doi = {10.1109/CVPR52733.2024.02497},
url = {https://mlanthology.org/cvpr/2024/lu2024cvpr-unifiedio/}
}