Safe RLHF: Safe Reinforcement Learning from Human Feedback

The researchers from Peking University have proposed a novel algorithm, Safe Reinforcement Learning from Human Feedback (Safe RLHF), aimed at enhancing the safety and performance of Large Language Models (LLMs). Safe RLHF decouples human preferences regarding helpfulness and harmlessness, training separate reward and cost models. This approach allows for dynamic adjustment of balance between the two objectives during fine-tuning. The algorithm was tested on the Alpaca-7B model, showing its ability to mitigate harmful responses while improving model performance. The code for Safe RLHF is available on GitHub.

Publication date: 19 Oct 2023
Project Page: https://github.com/PKU-Alignment/safe-rlhf
Paper: https://arxiv.org/pdf/2310.12773

Post Views: 299

Press ESC to close

Share Article:

root

Hybrid Search for Efficient Planning with Completeness Guarantees

PSYCHIC: A Neuro-Symbolic Framework for Knowledge Graph Question-Answering Grounding

Please allow ads on our site