OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Li, Shufan; Kallidromitis, Konstantinos; Gokul, Akash; Liao, Zichun; Kato, Yusuke; Kozuka, Kazuki; Grover, Aditya

doi:10.1109/CVPR52734.2025.01230

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover

CVPR 2025 pp. 13178-13188

doi:10.1109/CVPR52734.2025.01230 /cvpr/2025/li2025cvpr-omniflow/

Abstract

We introduce OminiFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OminiFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01230

Markdown

[Li et al. "OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-omniflow/) doi:10.1109/CVPR52734.2025.01230

BibTeX

@inproceedings{li2025cvpr-omniflow,
  title     = {{OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows}},
  author    = {Li, Shufan and Kallidromitis, Konstantinos and Gokul, Akash and Liao, Zichun and Kato, Yusuke and Kozuka, Kazuki and Grover, Aditya},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {13178-13188},
  doi       = {10.1109/CVPR52734.2025.01230},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-omniflow/}
}