This paper introduces RIPPLE, an optimization-based method that exploits subconsciousness and echopraxia to jailbreak Large Language Models (LLMs). LLMs, due to their extraordinary reasoning and comprehension abilities, are used widely across sectors. However, safety concerns arise as they are prone to jailbreaking prompts that bypass safety measures to elicit violent and harmful content. RIPPLE addresses these concerns by automatically generating diverse, efficient, and potent jailbreaking prompts. Tests show that RIPPLE has an average Attack Success Rate of 91.5%, outperforming current methods significantly.

 

Publication date: 8 Feb 2024
Project Page: https://github.com/SolidShen/RIPPLE_official/tree/official
Paper: https://arxiv.org/pdf/2402.05467