A well-known security researcher invested $1,500 of his own money to systematically test over a dozen leading large language models on their ability to autonomously complete a real-world penetration testing task. The results showed a stark division in current AI capabilities for independent security research, with the vast majority of models scoring zero, while OpenAI's GPT-5.5 stood out alone with a 70% success rate.
For this test, the researcher created a fake book review application called "BookNook" to serve as a target range. The models were challenged to autonomously discover and exploit a hidden security vulnerability within this app, operating under a strict budget of no more than $10 and a two-hour time limit per attempt.
GPT-5.5 led the pack decisively among the nine models that completed ten full test rounds, achieving a success rate of 7 out of 10. DeepSeek V4 Pro and two versions of the Claude model managed some successes, while the remaining five models failed to score any points at all.
The findings hold direct relevance for both AI capability assessment and corporate security. On one hand, the autonomous vulnerability discovery demonstrated by GPT-5.5 signals that AI-assisted security testing is nearing practical utility. On the other hand, the failure of most models—due to safety refusal mechanisms, flawed reasoning pathways, or API instability—indicates this field is still some distance from large-scale, reliable application.
Test Methodology: Realistic Vulnerabilities with Tight Budgets
The researcher, who provides security services for various applications and websites professionally, built the test environment to replicate a common type of vulnerability he frequently encounters. The setup consisted of a front-end React Native app built with Expo and a Python back-end, simulating the "BookNook" review platform. The objective was clear: find a hidden "flag" within a user's private book review.
Each test run had a hard $10 budget cap and a two-hour time limit. Most models were driven via the pi framework with the pi-goal-x extension to ensure persistence, while Claude used its Claude Code -p mode. All models ran in a high-reasoning mode with the temperature set to 0.7. A key prerequisite was noted: the researcher's OpenAI account had pre-approval for safety research, which prevented the GPT models from refusing the task.
The initial plan was to run ten full rounds per model, but the project was scaled back as costs quickly soared to $1,500. The researcher noted that roughly half the total cost came from trial runs and failed attempts not included in the final statistics, clarifying that this was more of a personal interest project than a strict scientific experiment.
Report Card: GPT-5.5 Leads, Chinese Models Show Mixed Results
Among the models that completed ten full test rounds, performance was sharply polarized.
GPT-5.5 topped the list with its 7/10 success rate (95% confidence interval: 40% to 89%). Its average run cost was $6.62, with a cost per success of $9.46, using a median of about 260,000 tokens. The researcher observed that this model consistently and quickly focused on Firebase after decompressing the APK file, avoiding wasted effort on the API or React Native app layer.
DeepSeek V4 Pro placed second with a 3/10 success rate, but its cost efficiency was highly competitive. The average run cost was just $0.19, with a cost per success of only $0.62. However, five of its ten runs never reached Firebase, getting stuck at the API level. The other five runs recognized Firebase as a target, but two of them incorrectly tried to apply Firebase authentication to the API instead of directly interacting with Firebase.
Both Claude Sonnet 4.6 and Claude Opus 4.8 scored 2/10, but their paths differed. Sonnet 4.6 had five runs on the right track but were halted by the budget limit. Opus 4.8 came close to the answer multiple times but was stopped when its safety guardrails triggered later in the session—not at the task's outset, but during the progression.
The other five models that completed ten rounds—DeepSeek V4 Flash, Gemini 3.1 Pro Preview, Gemini 3.5 Flash, MiniMax M2.7, and Step 3.7 Flash—all scored 0/10. Gemini 3.1 Pro Preview failed most directly, refusing the task almost immediately on safety grounds. Its median token usage was a mere 9,000, far below the 100,000+ of other models, indicating it never substantively engaged with the task. Gemini 3.5 Flash also showed numerous early refusals, with only two runs making a real attempt. Step 3.7 Flash exhibited a different failure pattern: it meticulously documented the API but then incorrectly claimed to have found the vulnerability without actually succeeding.
Models with Incomplete Testing: Kimi Surprises, Qwen Disappoints
Due to cost constraints, the researcher only ran partial tests on six other models.
Kimi K2.6 emerged as a surprising highlight with a perfect 1/1 record. Its completion speed and token usage were comparable to successful runs by DeepSeek V4 Pro, with a cost per success of just $1.02. Further testing was not possible because Kimi's API does not support concurrent proxy calls, has a low tokens-per-minute quota, and counts cached tokens against that quota.
Qwen 3.7 Max proved disappointing. In pre-assessment local testing, it was the only non-GPT model that could complete the task. However, it failed in all six formal runs, often fixating on searching for IDOR vulnerabilities at the API level. More strikingly, its median token usage per run was a staggering 7.32 million, costing $8.71 per attempt.
GLM 5.1 barely made the list with a 1/4 score, but the researcher's review was extremely negative, stating he would "never use GLM again." The reasons were frequent API outages causing mid-run failures and exorbitant token consumption (median 1.25 million), leading to high costs. Grok Build 0.1 failed in all six runs, with some runs producing false positives by misidentifying normal user behavior (reading one's own reviews) as an IDOR vulnerability. MiniMax M3 and MiniMax M2.7 showed similar behavior: after discovering Firebase, they would abandon the approach upon encountering the first error and instead try to use Firebase credentials to attack the API.
Additionally, the researcher tested Owl Alpha, primarily because it was freely available on OpenRouter. It failed in all ten runs, with one attempt sending over 200 requests to the API without ever finding the vulnerability.
Key Takeaways: Infrastructure Woes and Spiraling Costs
The researcher concluded with several practical lessons for others attempting similar tests.
On infrastructure, he used Modal as the runtime environment because test logs were too large for local storage. This turned out to be a mistake, as about 10% of runs were preemptively terminated on Modal, resulting in complete data loss. He recommends AWS as an alternative. For model access, he suggests using a unified platform like OpenRouter would be far less cumbersome than dealing with each provider's distinct API.
An interesting behavioral observation was a cultural difference in model approaches: Chinese models appeared more "comfortable" directly attacking a database, while other models often exhibited brief hesitation, with prompts like "this would affect the production database, so I'm not going to do it."
On cost control, the researcher admitted spending far more than anticipated and joked that the money could have been used to launch his own real application. He explicitly stated that MiniMax and GLM are now off his future testing list due to poor API stability and high costs.
Comments