What truly determines victory or defeat in the AI Agent race? The answer from 1,781 real-world runs is not the model.
AI evaluation platform Braintrust extracted 1,781 complete execution traces of Agents in production from Hugging Face, covering the performance of six mainstream models across six major task categories, and then scored each trace using GPT-4o. The first conclusion is striking: keeping the model unchanged and only switching the "agent framework" (harness) that wraps it can increase the success rate from 12% directly to 92%—a fluctuation exceeding 80 percentage points.
Regression analysis quantified this intuition into precise figures. After controlling for the benchmark test and the model variables, the agent framework can explain about 5.3% of the variance in success rate, while the model explains only 0.7%. The influence of switching the agent framework is over 7 times greater than switching the model. More crucially, the cost of switching agent frameworks is almost zero—Token consumption for different frameworks on the same task is largely comparable.
For AI startups, this data rewrites the rules of competition. As the commoditization of the model layer accelerates, and the performance gap between the six mainstream models on programming tasks has narrowed to single-digit percentage points, "which model to choose" is no longer the decisive variable. "What tools are used to deploy the model to production" and "what level the reasoning cost per successful task is controlled at"—these two capabilities are replacing "which model is accessed" as the core variables separating winners from losers.
The Agent Framework: The Greatest Lever for an 81-Point Success Rate Swing
Braintrust tested five agent frameworks with completely different architectures. claude_code is Anthropic's native Agent loop, allowing the model to autonomously manage tool calls and context in an XML-like format. smolagents_code allows the model to write Python code to chain operations. tool_calling is the standard structured JSON function call, one tool at a time. tool_calling_with_shortlisting pre-filters available tools each round on top of the former. openai_solo is the thinnest OpenAI wrapper.
The data from switching frameworks with the same model and task is startling. For Claude on the SWE-bench programming task, the success rate was 100% under claude_code, but plummeted to 14% when switched to tool_calling. For Kimi on the AppWorld multi-application orchestration task, it was 92% under smolagents_code, but only 12% under tool_calling. For GPT-4.1 on the telecom customer service task, it was 51% under smolagents_code, but only 18% under claude_code.
Behind each success rate cliff is the same model. Minor differences in agent framework design—whether to let the model autonomously manage context or constrain each step with a fixed template; whether to allow the model to write code to chain tool calls or only call one tool at a time—stretch the success rate gap to nearly an order of magnitude.
The failure of tool_calling_with_shortlisting is particularly noteworthy. This framework attempted to improve efficiency by "narrowing the available tool list each round," but the data shows it actually hindered performance—narrowing options may cut out useful tools or introduce routing errors. "More precise control" does not automatically equal "better results."
The Productivity Ledger for Open-Source Models: $0.73 Per Successful Programming Task
On the SWE-bench programming benchmark, open-source models' scores are in the same tier as the top closed-source models. DeepSeek V3.2 achieved a 96% success rate, Kimi K2.5 reached 94%, Claude Opus 4.5 was 100%, GPT-5.2 was 93%, and Gemini 3 Pro was 87%.
But the real watershed is on the cost side. Braintrust priced each run based on LiteLLM's actual Token rates and then calculated the cost per successful task by factoring in the success rate.
On SWE-bench, claude_code paired with Kimi K2.5 cost only $0.73 per success, and with DeepSeek V3.2 it was $1.27. The closed-source Claude Opus cost $4.28, and Gemini 3 Pro cost $4.97. On the AppWorld task, the gap widened further: Kimi with smolagents_code cost only $0.40 per success, while Claude with claude_code cost as high as $84.33—a difference of over 200 times.
Open-source models also have a cost structure advantage that closed-source models lack: self-hosting. There's no per-call payment and no passive risk from API price hikes. For companies needing large-scale Agent deployment, this forms a structural cost moat that short-term Token price reductions cannot erase.
"Cheapest Tokens" Does Not Equal "Highest Efficiency"
GPT-4.1 plays a textbook negative role in this analysis.
Its Token bill looks astonishingly good on paper—10 to 100 times cheaper than other models on equivalent tasks. But Braintrust, upon examining each execution trace, found that GPT-4.1's failure rate on hardcore tasks like SWE-bench and AppWorld ranged from 53% to 90%. Its "cheapness" is because it "fails faster."
A cost metric without a success rate is not an efficiency metric; it's a number for "using fewer Tokens to complete one failure." The correct dimension for measuring efficiency is cost per success, which is the cost per task divided by the success rate. This metric completely reshapes the configuration rankings.
On programming tasks, open-source models occupy the optimal position on the cost-efficiency frontier. On conversational customer service tasks, the situation is completely reversed—GPT-4.1 leads significantly with a cost per success of $0.02 to $0.03, compared to Claude's $1.95, and the open-source models did not even run these dialogue tests.
For AI startups, there is no single "cheapest model" that works for everything. Use self-hosted DeepSeek or Kimi for coding tasks, and GPT-4.1 for customer service dialogues—different task families correspond to completely different cost-optimal solutions.
No Universal Model, Only Task-Specific Optimal Solutions
Six benchmark tests, four different champions.
Claude won SWE-bench (programming), BrowseComp+ (web research), and TAU2 retail/telecom customer service. Gemini took first place on TAU2 airline customer service with a 100% success rate. DeepSeek and Kimi led significantly on the AppWorld multi-application orchestration task. There is no single model that dominates all scenarios.
Even within the same agent framework, the performance of different models varies widely. In the AppWorld task, Claude had only a 26% success rate under its native claude_code, far below DeepSeek's 80% and Kimi's 78% under the same framework. The fit between the model and the task, and the synergy between the model and the agent framework, are far better predictors of final performance than the absolute scale of the model's parameters.
Braintrust also found that high average success rates can mask fatal local collapses. Some configurations score well overall but completely fail on a specific task type. Plotting the standard deviation of each configuration's cross-task success rates clearly separates high-variance configurations from reliable ones—Claude Opus's claude_code, while leading overall at 73% versus Gemini's 71%, had a higher cross-task standard deviation (0.27 vs. 0.24), meaning its performance fluctuated more on certain test suites.
For a startup's procurement strategy, this means one should not bet on a single model. A sensible path is to build a differentiated model-agent framework combination matrix by task type, allowing each category of task to run on the most suitable, cost-optimal configuration.
Two Types of Failure, Two Completely Opposite Engineering Strategies
Braintrust also revealed a pattern with direct implications for engineering deployment: the behavior of Agents upon failure is completely opposite in direction for coding tasks versus dialogue tasks.
In hardcore programming tasks like SWE-bench and AppWorld, failure comes with "thrashing"—the Agent makes more LLM calls, consumes more Tokens, and runs longer than its successful counterparts. Failed runs in BrowseComp+ consumed 2.3 times the Tokens of successful runs. The median Token usage for failed runs under the claude_code framework was about 0.8M, with the tail even exceeding 3.7M.
In TAU2 customer service dialogue tasks, the pattern is completely reversed. Failed Agents make fewer calls, use fewer Tokens, and finish faster—no thrashing or struggling, they confidently deliver a wrong answer and then stop.
These two diametrically opposed failure modes mean that monitoring strategies in production cannot use one rule to cover all scenarios. Coding tasks require upper-limit alerts on Token usage—to stop losses in time when an Agent falls into an infinite loop or struggles repeatedly. Dialogue tasks require lower-limit alerts—to capture anomalies where an error is delivered "too smoothly." A one-size-fits-all single threshold would help one type of task while destroying another.
Reasoning Cost Management and Deployment Efficiency: The True Barriers for Startups
This data from Braintrust tells a more fundamental story than "whose model scores higher."
The success rate gap among the six mainstream models on programming tasks has narrowed to single-digit percentage points, and the cost per success for open-source models is now systematically lower than for closed-source ones. The commoditization speed of the model layer is faster than most people anticipated. Continuing to build a business story on "which latest model is accessed" means the moat is rapidly evaporating.
What is truly starting to create separation are three capabilities beyond the model: matching the optimal agent framework for each task type, measuring efficiency by cost per success rather than cost per task, and establishing differentiated failure monitoring systems for different task types.
The core of these three things points to the same set of keywords—fine-grained management of reasoning costs and systematic optimization of deployment efficiency. In the AI Agent race, a more critical question than "how much better is your model than mine" is: to what level do you control the cost per success for a given task? Can you deliver the same success rate below a cost line that a customer's self-built solution cannot achieve?
For B2B AI startups, the focus of product definition needs to shift from "which model we have integrated" to "in what task scenarios, with what cost structure, and at what success rate do we deliver." The narrative is no longer about comparing models—it's about comparing costs, efficiency, and engineering.
Comments