GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-Training Models via Global-Local Transformations
Abstract
Vision-language pre-training (VLP) models leverage large-scale cross-modal pre-training to align vision and text modalities, achieving impressive performance on tasks like image-text retrieval and visual grounding. However, these models are highly vulnerable to adversarial attacks, raising critical concerns about their robustness and reliability in safety-critical applications. Existing black-box attack methods are limited by insufficient data augmentation mechanisms or the disruption of global semantic structures, leading to poor adversarial transferability. To address these challenges, we propose the Global-Local Enhanced Adversarial Multimodal attack (GLEAM), a unified framework for generating transferable adversarial examples in vision-language tasks. GLEAM introduces a local feature enhancement module that achieves diverse local deformations while maintaining global semantic and geometric integrity. It also incorporates a global distribution expansion module, which expands feature space coverage through dynamic transformations. Additionally, a cross-modal feature alignment module leverages intermediate adversarial states to guide text perturbations. This enhances cross-modal consistency and adversarial text transferability. Extensive experiments on Flickr30K and MSCOCO datasets show that GLEAM outperforms state-of-the-art methods, with over 10%-30% higher attack success rates in image-text retrieval tasks and over 30% improved transferability on large models like Claude 3.5 Sonnet and GPT-4o. GLEAM provides a robust tool for exposing vulnerabilities in VLP models and offers valuable insights into designing more secure and reliable vision-language systems. Our code is available at https://github.com/LuckAlex/GLEAM.
Cite
Text
Liu et al. "GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-Training Models via Global-Local Transformations." International Conference on Computer Vision, 2025.Markdown
[Liu et al. "GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-Training Models via Global-Local Transformations." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liu2025iccv-gleam/)BibTeX
@inproceedings{liu2025iccv-gleam,
title = {{GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-Training Models via Global-Local Transformations}},
author = {Liu, Yunqi and Ouyang, Xue and Cui, Xiaohui},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {1665-1674},
url = {https://mlanthology.org/iccv/2025/liu2025iccv-gleam/}
}