AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models
The paper discusses the vulnerability of Large Language Models (LLMs) to jailbreak attacks that deviate them from safe behaviors, producing content misaligned with human values. The authors introduce an interpretable…
Continue reading