BADFSS: Backdoor Attacks on Federated Self-Supervised Learning

Abstract

Human preference alignment (HPA) aims to ensure Large Language Models (LLMs) responding appropriately to meet human moral and ethical requirements. Existing methods, such as RLHF and DPO, rely heavily on high-quality human annotation, which restrict the efficiency of iterative online model refinement. To address the inefficiencies of human annotation acquisition, iterated online strategy advocates the use of fine-tuned LLMs to self-generate preference data. However, this approach is prone to distribution bias, because of differences between human and model annotations, as well as modeling errors between simulators and real-world contexts. To mitigate the impact of distribution bias, we adopt the principles of adversarial training, framing a zero-sum two-player game with a protagonist agent and an adversarial agent. With the adversarial agent challenging the alignment of protagonist agent, we continuously refine the protagonist’s performance. By utilizing min-max equilibrium and Nash equilibrium strategies, we propose Indirect Online Preference Optimization (IOPO) mechanism that enables the protagonist agent to converge without bias while maintaining linear computational complexity. Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations. This innovation reduces the time required for model iterations from months to one week, alleviates distribution shifts, and significantly cuts annotation costs.

Cite

Text

Zhang et al. "BADFSS: Backdoor Attacks on Federated Self-Supervised Learning." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/61

Markdown

[Zhang et al. "BADFSS: Backdoor Attacks on Federated Self-Supervised Learning." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/zhang2024ijcai-badfss/) doi:10.24963/ijcai.2024/61

BibTeX

@inproceedings{zhang2024ijcai-badfss,
  title     = {{BADFSS: Backdoor Attacks on Federated Self-Supervised Learning}},
  author    = {Zhang, Jiale and Zhu, Chengcheng and Wu, Di and Sun, Xiaobing and Yong, Jianming and Long, Guodong},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {548-558},
  doi       = {10.24963/ijcai.2024/61},
  url       = {https://mlanthology.org/ijcai/2024/zhang2024ijcai-badfss/}
}