On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models

This academic article discusses the security vulnerabilities in Reinforcement Learning with Human Feedback (RLHF) in Large Language Models (LLMs). RLHF plays a crucial role in aligning LLMs with human preferences. However, it relies on human annotators to rank the text, introducing potential security vulnerabilities if any adversarial annotator manipulates the ranking score by up-ranking any malicious text. To assess this, the authors propose RankPoison, a poisoning attack method on candidates’ selection of preference rank flipping to reach certain malicious behaviors. The findings highlight critical security challenges in RLHF, underscoring the need for more robust alignment methods for LLMs.

Publication date: 17 Nov 2023
Project Page: Not provided
Paper: https://arxiv.org/pdf/2311.09641

Post Views: 326

Press ESC to close

Share Article:

root

Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance

Reducing Privacy Risks in Online Self-Disclosures with Language Models

Please allow ads on our site