Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
Abstract
This study focuses on a novel task in text-to-image (T2I) generation namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features including appearance. To overcome the preference for low-level features and the entanglement of high-level features we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens thereby increasing the representational richness while distributing the inversion across different features. Then to block the inversion of action-agnostic features ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task we present an ActionBench that includes a variety of actions each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.
Cite
Text
Huang et al. "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00745Markdown
[Huang et al. "Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/huang2024cvpr-learning/) doi:10.1109/CVPR52733.2024.00745BibTeX
@inproceedings{huang2024cvpr-learning,
title = {{Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation}},
author = {Huang, Siteng and Gong, Biao and Feng, Yutong and Chen, Xi and Fu, Yuqian and Liu, Yu and Wang, Donglin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {7797-7806},
doi = {10.1109/CVPR52733.2024.00745},
url = {https://mlanthology.org/cvpr/2024/huang2024cvpr-learning/}
}