The research by Zhan et al. explores the vulnerability of large language models (LLMs), particularly GPT-4, due to fine-tuning. The researchers demonstrate that fine-tuning can be used to remove RLHF (reinforcement learning with human feedback) protections, a common method used to reduce harmful outputs. Despite the expectation that GPT-4 would be less susceptible to fine-tuning attacks, the study shows that these protections can be removed with a 95% success rate using as few as 340 examples. The findings underline the need for further research on protections for LLMs.

 

Publication date: 9 Nov 2023
Project Page: https://arxiv.org/abs/2311.05553v1
Paper: https://arxiv.org/pdf/2311.05553