C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
Abstract
We present Compound Conditioned ControlNet C3Net a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g. image text audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then it generates multimodal outputs based on the aligned latent space whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly with this system design our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions involving more than just linear interpolation within the latent space. Meanwhile as we align conditions to a unified latent space C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore our model employs unimodal pretraining on the condition alignment stage outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with the first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released.
Cite
Text
Zhang et al. "C3Net: Compound Conditioned ControlNet for Multimodal Content Generation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02539Markdown
[Zhang et al. "C3Net: Compound Conditioned ControlNet for Multimodal Content Generation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/zhang2024cvpr-c3net/) doi:10.1109/CVPR52733.2024.02539BibTeX
@inproceedings{zhang2024cvpr-c3net,
title = {{C3Net: Compound Conditioned ControlNet for Multimodal Content Generation}},
author = {Zhang, Juntao and Liu, Yuehuai and Tai, Yu-Wing and Tang, Chi-Keung},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {26886-26895},
doi = {10.1109/CVPR52733.2024.02539},
url = {https://mlanthology.org/cvpr/2024/zhang2024cvpr-c3net/}
}