PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in Non-English Text-to-Image Generation

Abstract

Text-to-image diffusion models are well known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation.

Cite

Text

Ma et al. "PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in Non-English Text-to-Image Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73113-6_6

Markdown

[Ma et al. "PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in Non-English Text-to-Image Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/ma2024eccv-peadiffusion/) doi:10.1007/978-3-031-73113-6_6

BibTeX

@inproceedings{ma2024eccv-peadiffusion,
  title     = {{PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in Non-English Text-to-Image Generation}},
  author    = {Ma, Jian and Chen, Chen and Xie, Qingsong and Lu, Haonan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73113-6_6},
  url       = {https://mlanthology.org/eccv/2024/ma2024eccv-peadiffusion/}
}