CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Abstract

Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures.

Cite

Text

Kim et al. "CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation." International Conference on Learning Representations, 2026.

Markdown

[Kim et al. "CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/kim2026iclr-cage/)

BibTeX

@inproceedings{kim2026iclr-cage,
  title     = {{CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation}},
  author    = {Kim, Chaeyun and Lim, YongTaek and Kim, Kihyun and Kim, Junghwan and Kim, Minwoo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/kim2026iclr-cage/}
}