CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

Abstract

Generating shapes using natural language can enable new ways of imagining and creating the things around us. While significant recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zero-shot text-to-shape generation that circumvents such data scarcity. Our proposed method, named CLIP-Forge, is based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method has the benefits of avoiding expensive inference time optimization, as well as the ability to generate multiple shapes for a given text. We not only demonstrate promising zero-shot generalization of the CLIP-Forge model qualitatively and quantitatively, but also provide extensive comparative evaluations to better understand its behavior.

Cite

Text

Sanghi et al. "CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01805

Markdown

[Sanghi et al. "CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/sanghi2022cvpr-clipforge/) doi:10.1109/CVPR52688.2022.01805

BibTeX

@inproceedings{sanghi2022cvpr-clipforge,
  title     = {{CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation}},
  author    = {Sanghi, Aditya and Chu, Hang and Lambourne, Joseph G. and Wang, Ye and Cheng, Chin-Yi and Fumero, Marco and Malekshan, Kamal Rahimi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {18603-18613},
  doi       = {10.1109/CVPR52688.2022.01805},
  url       = {https://mlanthology.org/cvpr/2022/sanghi2022cvpr-clipforge/}
}