The paper discusses the vulnerability of Large Language Models (LLMs) to jailbreak attacks that deviate them from safe behaviors, producing content misaligned with human values. The authors introduce an interpretable adversarial attack, AutoDAN, which combines the strengths of manual and automatic attacks. AutoDAN generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate. It offers a new way to understand the mechanism of jailbreak attacks and to ‘red-team’ LLMs. AutoDAN’s objective can also be customized to leak system prompts, a feature not addressed in previous adversarial attack literature.
Publication date: 23 Oct 2023
Project Page: https://arxiv.org/abs/2310.15140v1
Paper: https://arxiv.org/pdf/2310.15140