Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Abstract
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references.Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves perfect subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion.Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions.In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.
Cite
Text
Liu et al. "Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment." International Conference on Computer Vision, 2025.Markdown
[Liu et al. "Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liu2025iccv-phantom/)BibTeX
@inproceedings{liu2025iccv-phantom,
title = {{Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment}},
author = {Liu, Lijie and Ma, Tianxiang and Li, Bingchuan and Chen, Zhuowei and Liu, Jiawei and Li, Gen and Zhou, Siyu and He, Qian and Wu, Xinglong},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {14951-14961},
url = {https://mlanthology.org/iccv/2025/liu2025iccv-phantom/}
}