OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Ye, Hanrong; Yang, Chao-Han Huck; Goel, Arushi; Huang, Wei; Wan, Zhen; Tian, Jinchuan; Cheng, An-Chieh; Zhu, Ligeng; Su, Yuanhang; Lou, Yuming; Lin, Yong-Xiang; Yang, Dong; Ghosh, Sreyan; Liu, Zhijian; Chen, Yukang; Jahangiri, Ehsan; Dantrey, Ambrish; Xu, Daguang; Hosseini-Asl, Ehsan; Taheri, Seyed Danial Mohseni; Murali, Vidya Nariyambut; Liu, Sifei; Lu, Yao; Olabiyi, Oluwatobi; Wang, Yu-Chiang Frank; Valle, Rafael; Catanzaro, Bryan; Tao, Andrew; Han, Song; Kautz, Jan; Yin, Hongxu; Molchanov, Pavlo

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

ICLR 2026

/iclr/2026/ye2026iclr-omnivinci/

Abstract

Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, improves over Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens — a 6× reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ye et al. "OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM." International Conference on Learning Representations, 2026.

Markdown

[Ye et al. "OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ye2026iclr-omnivinci/)

BibTeX

@inproceedings{ye2026iclr-omnivinci,
  title     = {{OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM}},
  author    = {Ye, Hanrong and Yang, Chao-Han Huck and Goel, Arushi and Huang, Wei and Wan, Zhen and Tian, Jinchuan and Cheng, An-Chieh and Zhu, Ligeng and Su, Yuanhang and Lou, Yuming and Lin, Yong-Xiang and Yang, Dong and Ghosh, Sreyan and Liu, Zhijian and Chen, Yukang and Jahangiri, Ehsan and Dantrey, Ambrish and Xu, Daguang and Hosseini-Asl, Ehsan and Taheri, Seyed Danial Mohseni and Murali, Vidya Nariyambut and Liu, Sifei and Lu, Yao and Olabiyi, Oluwatobi and Wang, Yu-Chiang Frank and Valle, Rafael and Catanzaro, Bryan and Tao, Andrew and Han, Song and Kautz, Jan and Yin, Hongxu and Molchanov, Pavlo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ye2026iclr-omnivinci/}
}