This academic article discusses the security vulnerabilities in Reinforcement Learning with Human Feedback (RLHF) in Large Language Models (LLMs). RLHF plays a crucial role in aligning LLMs with human preferences. However, it relies on human annotators to rank the text, introducing potential security vulnerabilities if any adversarial annotator manipulates the ranking score by up-ranking any malicious text. To assess this, the authors propose RankPoison, a poisoning attack method on candidates’ selection of preference rank flipping to reach certain malicious behaviors. The findings highlight critical security challenges in RLHF, underscoring the need for more robust alignment methods for LLMs.
Publication date: 17 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.09641