Exploring Sparse MoE in GANs for Text-Conditioned Image Synthesis

Abstract

Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling out of grace with the task of text-conditioned image synthesis. Sparsely activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited resources. Inspired by this, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to adaptively select the most suitable expert for each feature point. We adopt a two-stage training strategy, which first learns a base model at 64x64 resolution followed by an upsampler to produce 512x512 images. Trained with only public data, our approach encouragingly closes the performance gap between GANs and industry-level diffusion models, maintaining a fast inference speed. We release the code and checkpoints \href https://github.com/zhujiapeng/Aurora here to facilitate the community for further development.

Cite

Text

Zhu et al. "Exploring Sparse MoE in GANs for Text-Conditioned Image Synthesis." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01716

Markdown

[Zhu et al. "Exploring Sparse MoE in GANs for Text-Conditioned Image Synthesis." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhu2025cvpr-exploring/) doi:10.1109/CVPR52734.2025.01716

BibTeX

@inproceedings{zhu2025cvpr-exploring,
  title     = {{Exploring Sparse MoE in GANs for Text-Conditioned Image Synthesis}},
  author    = {Zhu, Jiapeng and Yang, Ceyuan and Zheng, Kecheng and Xu, Yinghao and Shi, Zifan and Zhang, Yifei and Chen, Qifeng and Shen, Yujun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {18411-18423},
  doi       = {10.1109/CVPR52734.2025.01716},
  url       = {https://mlanthology.org/cvpr/2025/zhu2025cvpr-exploring/}
}