This research investigates the use of large language models (LLMs) in software penetration testing (pentesting) to identify security vulnerabilities in source code. The authors hypothesize that an LLM-based AI agent can be improved over time for a specific security task as human operators interact with it. The study uses the OWASP Benchmark Project 1.2 and tests different versions of the AI agent using Google’s Gemini-pro and OpenAI’s GPT-3.5-Turbo and GPT-4-Turbo. The preliminary results show that using LLMs is a viable approach to build an AI agent for software pentesting that can improve through repeated use and prompt engineering.

 

Publication date: 30 Jan 2024
Project Page: https://arxiv.org/abs/2401.17459v1
Paper: https://arxiv.org/pdf/2401.17459