Universal Jailbreak Backdoors from Poisoned Human Feedback

This research paper explores the potential for ‘jailbreak backdoors’ in large language models trained with Reinforcement Learning from Human Feedback (RLHF). It reveals that a malicious actor could potentially poison the training data to embed a backdoor into the model. This backdoor could be activated with a trigger word, enabling harmful or unaligned responses. The study also investigates the robustness of RLHF and releases a benchmark of poisoned models for future research.

Publication date: 24 Nov 2023
Project Page: https://arxiv.org/abs/2311.14455v1
Paper: https://arxiv.org/pdf/2311.14455

Post Views: 295

root

Exit mobile version

Please allow ads on our site

Looks like you're using an ad blocker. Please support us by disabling these ad blocker.

Press ESC to close

Share Article:

root

Differentially Private SGD Without Clipping Bias: An Error-Feedback Approach

Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models

Please allow ads on our site