LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Abstract

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face–attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject–attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.

Cite

Text

Xing et al. "LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation." International Conference on Learning Representations, 2026.

Markdown

[Xing et al. "LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xing2026iclr-lumosx/)

BibTeX

@inproceedings{xing2026iclr-lumosx,
  title     = {{LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation}},
  author    = {Xing, Jiazheng and Du, Fei and Yuan, Hangjie and Liu, Pengwei and Xu, Hongbin and Ci, Hai and Niu, Ruigang and Chen, Weihua and Wang, Fan and Liu, Yong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xing2026iclr-lumosx/}
}