Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study

Baswa, Tanay; Birur, Nitin Aravind; Kumar, Divyanshu; Loya, Jatan; Kumar, Anurakt; Harshangi, Prashanth; Agarwal, Sahil

Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study

Tanay Baswa, Nitin Aravind Birur, Divyanshu Kumar, Jatan Loya, Anurakt Kumar, Prashanth Harshangi, Sahil Agarwal

NeurIPSW 2024

/neuripsw/2024/baswa2024neuripsw-efficacy/

Abstract

Safety alignment and robustness of large language models (LLMs) remain critical challenges. This study presents a comprehensive evaluation of data generated using the SAGE process, a method designed to create nuanced and diverse synthetic data points for alignment and red-teaming. Our findings show that models aligned with SAGE-generated data achieve superior safety outcomes, including lower toxicity, bias, and harmful responses, while maintaining competitive performance on benchmark tasks. Alignment performed with data generated using the SAGE process requires only a fraction of the data needed by traditional datasets, such as PKU-SafeRLHF and Anthropic HH-RLHF, to achieve better alignment results, offering significant improvements in computational efficiency. The extensive categorization of harmful content by the SAGE process also provides finer granularity in aligning model behavior, enhancing visibility across various safety domains. This approach enables more precise and targeted alignment strategies, positioning the SAGE process as a valuable tool for developing safer and more trustworthy AI systems. Overall, we conclude that the SAGE process outperforms other popularly used open source alignment datasets, both in terms of mitigating harmful responses, and conserving computational resources.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Baswa et al. "Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.

Markdown

[Baswa et al. "Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/baswa2024neuripsw-efficacy/)

BibTeX

@inproceedings{baswa2024neuripsw-efficacy,
  title     = {{Efficacy of the SAGE-RT Dataset for Model Safety Alignment: A Comparative Study}},
  author    = {Baswa, Tanay and Birur, Nitin Aravind and Kumar, Divyanshu and Loya, Jatan and Kumar, Anurakt and Harshangi, Prashanth and Agarwal, Sahil},
  booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/baswa2024neuripsw-efficacy/}
}