Certifying Robustness to Adaptive Data Poisoning

Abstract

The rise of foundational models fine-tuned with human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. While existing research focuses on certifying robustness for static adversaries acting on offline datasets, dynamic attack algorithms have shown to be more effective. Relevant for models with periodic updates where an adversary can adapt based on the algorithm's behavior, such as those in RLHF, we present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean-estimation problem.

Cite

Text

Bose et al. "Certifying Robustness to Adaptive Data Poisoning." ICML 2024 Workshops: RLControlTheory, 2024.

Markdown

[Bose et al. "Certifying Robustness to Adaptive Data Poisoning." ICML 2024 Workshops: RLControlTheory, 2024.](https://mlanthology.org/icmlw/2024/bose2024icmlw-certifying/)

BibTeX

@inproceedings{bose2024icmlw-certifying,
  title     = {{Certifying Robustness to Adaptive Data Poisoning}},
  author    = {Bose, Avinandan and Udell, Madeleine and Lessard, Laurent and Fazel, Maryam and Dvijotham, Krishnamurthy Dj},
  booktitle = {ICML 2024 Workshops: RLControlTheory},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/bose2024icmlw-certifying/}
}