AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

The paper discusses the vulnerability of Large Language Models (LLMs) to jailbreak attacks that deviate them from safe behaviors, producing content misaligned with human values. The authors introduce an interpretable adversarial attack, AutoDAN, which combines the strengths of manual and automatic attacks. AutoDAN generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate. It offers a new way to understand the mechanism of jailbreak attacks and to ‘red-team’ LLMs. AutoDAN’s objective can also be customized to leak system prompts, a feature not addressed in previous adversarial attack literature.

Publication date: 23 Oct 2023
Project Page: https://arxiv.org/abs/2310.15140v1
Paper: https://arxiv.org/pdf/2310.15140

Post Views: 273

Press ESC to close

Share Article:

root

Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

Please allow ads on our site