The researchers from Peking University have proposed a novel algorithm, Safe Reinforcement Learning from Human Feedback (Safe RLHF), aimed at enhancing the safety and performance of Large Language Models (LLMs). Safe RLHF decouples human preferences regarding helpfulness and harmlessness, training separate reward and cost models. This approach allows for dynamic adjustment of balance between the two objectives during fine-tuning. The algorithm was tested on the Alpaca-7B model, showing its ability to mitigate harmful responses while improving model performance. The code for Safe RLHF is available on GitHub.
Publication date: 19 Oct 2023
Project Page: https://github.com/PKU-Alignment/safe-rlhf
Paper: https://arxiv.org/pdf/2310.12773