This paper delves into the novel human activity of attacking large language models (LLMs) to generate abnormal outputs – a practice known as ‘Red Teaming’. Interviews with practitioners from various backgrounds provide a comprehensive view of how and why people engage in such attacks. The study explores the motivations, strategies, techniques deployed, and the role of the community in this adversarial activity. The paper contributes a grounded theory of how and why people attack LLMs, shedding light on this emerging field.

 

Publication date: 10 Nov 2023
Project Page: https://arxiv.org/abs/2311.06237v1
Paper: https://arxiv.org/pdf/2311.06237