Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

Yoo, Haneul; Yang, Yongjin; Lee, Hwaran

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding

NeurIPSW 2024

/neuripsw/2024/yoo2024neuripsw-codeswitching/

Abstract

As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize code-switching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand code-switching texts. Additionally, we validate the extensibility of the CSRT by generating code-switching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Yoo et al. "Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding." NeurIPS 2024 Workshops: Red_Teaming_GenAI, 2024.

Markdown

[Yoo et al. "Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding." NeurIPS 2024 Workshops: Red_Teaming_GenAI, 2024.](https://mlanthology.org/neuripsw/2024/yoo2024neuripsw-codeswitching/)

BibTeX

@inproceedings{yoo2024neuripsw-codeswitching,
  title     = {{Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding}},
  author    = {Yoo, Haneul and Yang, Yongjin and Lee, Hwaran},
  booktitle = {NeurIPS 2024 Workshops: Red_Teaming_GenAI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/yoo2024neuripsw-codeswitching/}
}