Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Abstract

To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation (DGM4) has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM^4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.

Cite

Text

Li et al. "Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00863

Markdown

[Li et al. "Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-unleashing/) doi:10.1109/CVPR52734.2025.00863

BibTeX

@inproceedings{li2025cvpr-unleashing,
  title     = {{Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation}},
  author    = {Li, Yiheng and Yang, Yang and Tan, Zichang and Liu, Huan and Chen, Weihua and Zhou, Xu and Lei, Zhen},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {9242-9252},
  doi       = {10.1109/CVPR52734.2025.00863},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-unleashing/}
}