Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Kundu, Rohit; Xiong, Hao; Mohanty, Vishal; Balachandran, Athula; Roy-Chowdhury, Amit K.

doi:10.1109/CVPR52734.2025.02612

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

CVPR 2025 pp. 28050-28060

doi:10.1109/CVPR52734.2025.02612 /cvpr/2025/kundu2025cvpr-universal/

Abstract

Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the Universal Network for Identifying Tampered and Engineered videos (UNITE) model, which, unlike traditional detectors, captures full-frame manipulations. UNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that UNITE outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

PDF CVPR Semantic Scholar

Cite

Text

Kundu et al. "Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02612

Markdown

[Kundu et al. "Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/kundu2025cvpr-universal/) doi:10.1109/CVPR52734.2025.02612

BibTeX

@inproceedings{kundu2025cvpr-universal,
  title     = {{Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content}},
  author    = {Kundu, Rohit and Xiong, Hao and Mohanty, Vishal and Balachandran, Athula and Roy-Chowdhury, Amit K.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {28050-28060},
  doi       = {10.1109/CVPR52734.2025.02612},
  url       = {https://mlanthology.org/cvpr/2025/kundu2025cvpr-universal/}
}