Universal Jailbreak Backdoors from Poisoned Human Feedback
This research paper explores the potential for ‘jailbreak backdoors’ in large language models trained with Reinforcement Learning from Human Feedback (RLHF). It reveals that a malicious actor could potentially poison…
Continue reading