The research discusses the increasing security threats posed by Large Language Models (LLMs). Traditional jailbreak attacks, designed to assess the security defenses of LLMs, are easily recognized and defended by LLMs due to their explicit mention of malicious intent. To counter this, the researchers propose an indirect jailbreak attack approach, ‘Puzzler’, which can bypass the LLM’s defense strategy and obtain a malicious response by implicitly providing LLMs with clues about the original malicious query. Puzzler achieved a query success rate of 96.6% on closed-source LLMs, which is significantly higher than baselines.
Publication date: 14 Feb 2024
Project Page: https://arxiv.org/abs/2402.09091v1
Paper: https://arxiv.org/pdf/2402.09091