This research paper explores the potential for ‘jailbreak backdoors’ in large language models trained with Reinforcement Learning from Human Feedback (RLHF). It reveals that a malicious actor could potentially poison the training data to embed a backdoor into the model. This backdoor could be activated with a trigger word, enabling harmful or unaligned responses. The study also investigates the robustness of RLHF and releases a benchmark of poisoned models for future research.

 

Publication date: 24 Nov 2023
Project Page: https://arxiv.org/abs/2311.14455v1
Paper: https://arxiv.org/pdf/2311.14455