Downstream AI Model Companies Generate Massive Profits While NVIDIA and TSMC Retain Upside Potential

Deep News12:17

The AI value chain is undergoing a structural revaluation. Chip manufacturers, which previously captured the majority of profits, now face rapid catch-up from downstream model developers, yet the profit potential upstream remains far from its ceiling.

Analysis from SemiAnalysis indicates that Anthropic's annualized revenue surged from $90 billion to over $440 billion within months, with its inference gross margin increasing from 38% to over 70%. NVIDIA's current pricing framework remains cost-oriented and has yet to reflect the changing economics of inference workloads. Once adjusted, NVIDIA's system pricing possesses over 40% upside potential. Taiwan Semiconductor Manufacturing Company's N3 process capacity is also at the core of this value redistribution.

This outlook is underpinned by a structural mismatch between supply and demand: N3 process utilization is projected to exceed 100% in the second half of 2026, DRAM fabs are already operating at over 90% capacity, while demand for tokens from leading-edge models continues to expand at a compound rate. In this context, a window has opened for NVIDIA to implement differentiated pricing through its SOCAMM memory modules.

The AI value洼地 is shifting: the infrastructure layer is yielding to the model layer. From 2023 to early 2025, the vast majority of profits in the AI value chain accumulated at the infrastructure layer. NVIDIA led the surge, followed by power asset companies Vistra and GE Vernova, which rose 265% and 146% respectively in 2024. Storage manufacturers including SanDisk, Western Digital, Seagate, and Micron all achieved gains exceeding 200% in 2025.

The flip side of this dynamic was the long-term pressure on low margins endured by model creators and inference service providers. At that time, the practical utility of AI was limited, and market skepticism regarding AI investment returns was widespread.

A turning point arrived in December 2025. As Agentic AI became genuinely practical, the economic logic of AI was fundamentally rewritten. SemiAnalysis disclosed that its own annualized token consumption expenditure is approaching 30% of its employee compensation costs, with token usage per employee per month nearing 5 billion, over five times the internal usage rate at Meta. Tasks that previously required junior analysts several hours—including financial modeling, data visualization, and earnings analysis—can now be completed for just a few dollars worth of tokens.

SemiAnalysis estimates that its team's peak annualized spending on Anthropic's Claude reached $10.95 million, yet the competitive advantage gained far outweighed this cost. Anthropic benefited immediately: its Annual Recurring Revenue (ARR) skyrocketed from $90 billion to over $440 billion, and its inference gross margin jumped from 38% to over 70%.

Another core factor driving the margin expansion for model developers is the significant decline in token production costs. From a hardware perspective, on a standard inference task with 8K input and 1K output, a fully software-optimized B300 system can generate approximately 14,000 tokens per second per GPU, compared to only about 1,000 tokens for an unoptimized version—representing a 14x throughput improvement from software optimization alone on the same hardware. When combined with hardware upgrades, the optimally configured GB300 NVL72 shows a roughly 17x improvement in FP8 throughput compared to the H100. Switching to FP4 precision, which the H100 does not natively support, widens the gap to 32x, while the total cost of ownership per GPU for the GB300 is only about 70% higher.

From a pricing structure perspective, Agentic workloads feature extremely high input-to-output ratios (approximately 300:1 for Claude Code use cases) and very high cache hit rates (over 90%), causing the vast majority of tokens to fall into the lowest billing tier. SemiAnalysis estimates the true blended cost for Opus 4.7 on agentic tasks is approximately $0.99 per million tokens, significantly lower than the listed price of $5 per million input tokens.

Even facing substantial price cuts from Anthropic on the Opus series—Opus 4.5 was priced two-thirds lower than its predecessor—SemiAnalysis believes Anthropic's unit gross margin actually improved. This is due to production costs falling further with hardware upgrades, coupled with a large-scale user migration from Sonnet to Opus, which pushed up the blended Average Selling Price (ASP).

More strategically, Anthropic retains pricing power on its high-end product lines. Opus Fast is priced at six times the rate of standard Opus, while the announced Mythos is priced at $25/$125 per million tokens, five times the standard Opus rate. SemiAnalysis explicitly stated that if Anthropic were to offer a Mythos Fast tier at $150/$750 per million tokens, its team would still purchase it—because the value of productivity gains far exceeds the cost.

Regarding the sustainability of high margins for frontier models, the most common质疑 stems from competitive pressure. SemiAnalysis provides two counterarguments. First, the capability gap between frontier closed-source models and open-source models remains significant and is unlikely to close in the short term. Low-cost open-source models, exemplified by Kimi K2.6 ($0.95/$4 per million tokens), exert almost no substantive pressure on Opus pricing. Second, compute constraints mean no single frontier lab can serve the entire market alone. Anthropic is already managing demand by restricting Claude Code to subscriptions over $100 per month and limiting third-party access. Token demand is expected to outstrip supply for the foreseeable future, meaning labs capable of delivering truly frontier-quality models can price based on the economic value created by tokens rather than on competitive costs.

A notable structural question is why NVIDIA has not yet made substantive adjustments to its pricing framework amidst this profound reshaping of the AI value chain. NVIDIA's current pricing remains primarily anchored to cost, reflecting an old paradigm where demand value depreciates over time—an assumption that no longer holds. Current demand growth is not linear but expanding at a compound rate, driven by the explosion of agentic workloads and sustained increases in token consumption per workflow.

SemiAnalysis suggests that NVIDIA's pricing restraint may be partly due to regulatory concerns. NVIDIA's dominance in GPUs, interconnects, and software stacks is attracting increasing antitrust scrutiny. With downstream AI labs also generating substantial profits, aggressive price hikes could exacerbate regulatory risks and potentially accelerate customer diversification to alternative platforms like TPU and Trainium.

In this sense, NVIDIA's behavior resembles that of TSMC. TSMC has historically refrained from pushing pricing to the limits of scarcity premium, even when operating at full capacity and acting as a bottleneck for advanced process supply, instead prioritizing long-term ecosystem stability and customer relationships. This logic can be characterized as an "AI central bank" strategy—supporting downstream ecosystem expansion by conceding some profits, rather than maximizing short-term extraction, to ensure long-term dominance in the AI era.

However, this strategy entails real opportunity costs. In a structural environment where compute demand persistently exceeds supply, controlling scarce resources without fully pricing them equates to transferring value to the midstream and downstream of the ecosystem chain. TSMC faces a similar situation with its N3 process—SemiAnalysis directly labels this a "strategic mistake," suggesting they should at least demand larger prepayment arrangements.

NVIDIA's upcoming Vera Rubin VR NVL72 system presents an opportunity to reassess the pricing framework. From a cost perspective, calculations indicate the minimum required GPU rental rate for the VR NVL72 to achieve the same 15.6% project IRR (5-year, 15% prepayment) as the GB300 NVL72 is approximately $4.92 per GPU per hour. From a value perspective, if anchored to the current GB300 rental rate of approximately $0.70 per PFLOP for FP8 dense compute, the theoretical maximum price for the VR NVL72 would be around $12.25 per GPU per hour—roughly 2.5 times the cost floor.

This significant spread indicates NVIDIA has ample room to increase pricing for the VR NVL72. SemiAnalysis estimates that even if NVIDIA raises system pricing by approximately 40%, sufficient profit margin would remain for cloud providers like Neocloud—even if Neocloud increases its rental rate to over $8 per hour, the corresponding cost per PFLOP would still be below the historical trend line.

Mechanically, SOCAMM becomes the most critical pricing lever. Unlike the GB300, which integrates LPDDR5X memory directly onto the motherboard within the overall system price, the VR NVL72 utilizes pluggable SOCAMM modules, allowing NVIDIA to itemize and price memory as a separate, billable component.

SOCAMM (Small Outline Compression Attached Memory Module) is a new modular memory standard led by NVIDIA and developed in conjunction with memory manufacturers like Samsung, SK Hynix, and Micron. Based on LPDDR5X (or future LPDDR6) DRAM technology, it targets AI servers and personal AI supercomputers.

Modeling indicates that NVIDIA's contract price for SOCAMM in Q1 2026 was approximately $8 per GB, a significant increase from the previous quarter, primarily reflecting LPDDR5X supply tightness and overall DRAM price increases. Based on forecasts for mobile DRAM pricing by the end of 2026, SOCAMM pricing could exceed $13 per GB by year-end, with a full-year average of around $10 per GB being a reasonable assumption.

On this basis, SemiAnalysis argues that a 60% gross margin for NVIDIA on SOCAMM is justified. Reasons include: memory supply being universally tight, with NVIDIA having priority access to SOCAMM procurement; the VR NVL72's performance/Total Cost of Ownership (TCO) significantly outperforming contemporary competitors, leaving customers with few alternatives; and NVIDIA itself facing substantial increases in SOCAMM procurement costs, making cost pass-through to downstream customers reasonable.

Furthermore, memory pricing does not face the same level of antitrust scrutiny as GPU pricing, granting NVIDIA greater latitude for differentiated pricing—including implementing tiered pricing for providers like Neocloud versus hyper scalers. NVIDIA already charges Neocloud approximately twice the price for networking equipment compared to hyper scalers, and the same logic could readily be extended to memory pricing.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment