JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
Abstract
We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using the RGB-D diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGB-D generation, dense depth prediction, depth-conditioned image generation, and high-resolution 3D panorama generation.
Cite
Text
Zhang et al. "JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling." International Conference on Learning Representations, 2024.Markdown
[Zhang et al. "JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/zhang2024iclr-jointnet/)BibTeX
@inproceedings{zhang2024iclr-jointnet,
title = {{JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling}},
author = {Zhang, Jingyang and Li, Shiwei and Lu, Yuanxun and Fang, Tian and McKinnon, David Neil and Tsin, Yanghai and Quan, Long and Yao, Yao},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/zhang2024iclr-jointnet/}
}