Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues

The research discusses the increasing security threats posed by Large Language Models (LLMs). Traditional jailbreak attacks, designed to assess the security defenses of LLMs, are easily recognized and defended by LLMs due to their explicit mention of malicious intent. To counter this, the researchers propose an indirect jailbreak attack approach, ‘Puzzler’, which can bypass the LLM’s defense strategy and obtain a malicious response by implicitly providing LLMs with clues about the original malicious query. Puzzler achieved a query success rate of 96.6% on closed-source LLMs, which is significantly higher than baselines.

Publication date: 14 Feb 2024
Project Page: https://arxiv.org/abs/2402.09091v1
Paper: https://arxiv.org/pdf/2402.09091

Post Views: 301

Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Role-Playing Simulation Games using ChatGPT

Predicting User Experience on Laptops from Hardware Specifications

Leave a Reply Cancel reply

Please allow ads on our site