Are Large Language Models Really Robust to Word-Level Perturbations?

Wang, Haoyu; Ma, Guozheng; Yu, Cong; Gui, Ning; Zhang, Linrui; Huang, Zhiqi; Ma, Suwei; Chang, Yongzhe; Zhang, Sen; Shen, Li; Wang, Xueqian; Zhao, Peilin; Tao, Dacheng

Are Large Language Models Really Robust to Word-Level Perturbations?

Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, Dacheng Tao

TMLR 2025

/tmlr/2025/wang2025tmlr-large-a/

Abstract

The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLMs, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, potentially ignoring the superior generation capabilities of contemporary LLMs. To investigate the robustness of LLMs while using their generation ability, we propose a novel rational evaluation pipeline that leverages reward models as diagnostic tools to evaluate the long conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters. Our extensive empirical experiments demonstrate that TREvaL provides an identification for the lack of robustness of nowadays LLMs.Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted, calling for more attention on the robustness during alignment process.

PDF TMLR Semantic Scholar

Cite

Text

Wang et al. "Are Large Language Models Really Robust to Word-Level Perturbations?." Transactions on Machine Learning Research, 2025.

Markdown

[Wang et al. "Are Large Language Models Really Robust to Word-Level Perturbations?." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/wang2025tmlr-large-a/)

BibTeX

@article{wang2025tmlr-large-a,
  title     = {{Are Large Language Models Really Robust to Word-Level Perturbations?}},
  author    = {Wang, Haoyu and Ma, Guozheng and Yu, Cong and Gui, Ning and Zhang, Linrui and Huang, Zhiqi and Ma, Suwei and Chang, Yongzhe and Zhang, Sen and Shen, Li and Wang, Xueqian and Zhao, Peilin and Tao, Dacheng},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/wang2025tmlr-large-a/}
}