Two seemingly unrelated news items landed in the final week of June.
On June 25, OpenAI unveiled its first in-house AI inference chip, Jalapeño, developed in partnership with Broadcom from design to tape-out in just nine months—the world's largest GPU buyer has begun making its own chips.
On June 30, semiconductor research firm SemiAnalysis publicly announced on social media: NVIDIA's original 4-chip Rubin Ultra had been canceled just three months after its GTC 2026 unveiling, with the revised version's performance nearly halved. "This is all happening against a backdrop," the firm added, "of NVIDIA's market share being eroded."
Furthermore, as early as last year, media reports revealed that Anthropic's annualized revenue was approaching $70 billion, with its Claude Code product generating $500 million in annualized revenue within two months of launch. The computational foundation powering all this is no longer solely NVIDIA—Google TPUs handle training, Amazon Trainium handles inference, with NVIDIA GPUs relegated to a third option for "research exploration."
These three stories point to the same issue: the CUDA moat—NVIDIA's most robust and mythologized competitive barrier—is showing cracks.
From Dominance to Decline: The Erosion of NVIDIA's Indispensability
Consider a set of figures.
Based on estimates from Silicon Analysts using NVIDIA/AMD financial reports and TSMC capacity allocation data, NVIDIA's trajectory in the AI accelerator market (by revenue) is as follows: its share has declined from a peak of 87% towards 75%.
This shows that while NVIDIA's revenue is still growing—from $15 billion to $150 billion, a tenfold increase in four years—its declining share means an increasingly large portion of the incremental market is being carved away.
The entity carving away this share is not a single competitor, but competition from all directions: Google TPU, Amazon Trainium, Microsoft Maia, Meta MTIA, Broadcom's custom XPUs—and the newly joined OpenAI.
Broadcom CEO Hock Tan revealed a previously undisclosed figure in the Q1 FY2026 earnings call: Broadcom's AI semiconductor revenue has reached an $8.4 billion annualized run rate, a 106% year-over-year increase, on a trajectory to sprint towards a $40-50 billion annual run rate. The company has already signed six hyperscale customers for custom AI chips, with OpenAI being the sixth.
In other words, the world's largest cloud computing and AI companies have independently converged on the same direction: building their own chips.
The Anthropic Case Study
If market share data is cold statistics, then the Anthropic case is a living textbook on "de-NVIDIA-fication."
Anthropic is one of the world's fastest-growing AI companies. Its annualized revenue is approaching $70 billion (compared to roughly $10 billion in the same period in 2025), it serves over 300,000 enterprise customers, and its number of large clients has grown nearly sevenfold year-over-year. Claude Code generated $500 million in annualized revenue within two months of launch, which Anthropic calls "the fastest-growing product in history."
The computational foundation driving all this is a three-platform architecture Anthropic CFO Krishna Rao calls a "unique compute strategy." Notably, in this architecture, NVIDIA GPUs rank third—not as a co-equal option or a "backup," but as the smallest of the three platforms by scale.
This is not a cash-strapped small company making do with cheap alternatives. This is the world's second-largest AI company, in a production environment, using non-NVIDIA chips to power its fastest-growing product.
SemiAnalysis specifically highlighted this point in its June 30 post: "A significant portion of Claude Code's inference workload runs on Trainium, and Claude's training is completed on TPU. Just a year ago, it was hard to imagine TPU and Trainium scaling to this level while the CUDA moat was slowly eroding."
Why is Anthropic doing this? Not because TPUs and Trainium are more powerful than the H100—they may still lag in absolute performance. It's because in specific scenarios, proprietary chips offer far better cost-performance than general-purpose GPUs. Training uses TPUs because Google provided a multi-billion dollar contract and a supply commitment for millions of chips. Inference uses Trainium because AWS is its primary cloud provider, having invested $8 billion, and the Project Rainier supercomputing cluster runs entirely on Trainium 2, bypassing the GPU premium.
Amazon is betting big on Trainium. According to its Q1 2026 disclosure, the Trainium product line has secured over $225 billion in revenue commitments, with customers including OpenAI and Anthropic. AWS's AI revenue run rate exceeds $15 billion, with the majority of its Bedrock inference service running on Trainium.
The keyword here is not "performance," but "cost." Inference is a daily money-burning activity. Every time ChatGPT answers a question or an API returns code, a GPU is consuming electricity behind the scenes. Anthropic uses Trainium instead of GPUs for inference not to run faster, but to get more computations per dollar spent.
Three Paths of Erosion: Where the CUDA Moat is Cracking
CUDA is considered NVIDIA's most formidable moat because it built a closed "hardware-software-developer" ecosystem with 20 years of accumulation, over 4 million developers, all major ML frameworks optimized for CUDA first, and deeply binding optimization libraries like cuDNN, TensorRT, and NCCL. The switching cost is measured in years and billions of dollars.
But the AI chip competition in 2026 is no longer about "making a GPU 10% faster than the H100"—that's a frontal assault no one can win. The erosion is coming from three flanks.
Erosion Path One: In-House ASICs - Targeting the Richest Inference Slice
This is the most critical path. Its logic is not "I can do better than NVIDIA," but "I don't need all the functions of a GPU; I only need inference."
An NVIDIA H100 must handle: graphics rendering, scientific computing, AI training, AI inference, video codec... A Jalapeño does only one thing: run OpenAI's own models for inference. The former is a Swiss Army knife; the latter is an axe specialized for chopping one type of wood—for that specific task, the axe is much more effective and cheaper.
OpenAI's Jalapeño is positioned with extreme precision: it doesn't compete with NVIDIA on versatility, but excels solely in inference—the scenario consuming billions of API calls daily and burning hundreds of millions in annual costs. OpenAI's official goal is to reduce inference costs by 30-50%. At the scale of burning millions daily on inference, this translates to annual savings of hundreds of millions in pure profit.
And OpenAI is not the first. Microsoft Maia 200 (launched Jan 2026), Google TPU Ironwood (7th gen, first designed specifically for inference), Amazon Trainium 3—all four major cloud providers have unveiled their own inference chips. Add Meta's MTIA and Apple's custom chips, and among the world's top seven tech companies, only one remains in the "buy-only" camp—and it's on the way too.
Erosion Path Two: AMD - From Existence to Credible Alternative
AMD's AI GPU revenue has skyrocketed from under $1 billion in 2022 to an estimated over $15 billion in 2026, a more than 15-fold growth in four years.
The key turning point behind this is the MI400 series. Based on the CDNA5 architecture with 432GB HBM4 memory and 19.6 TB/s bandwidth, it is expected to enter mass production in H2 2026. S&P Global predicts the MI400 series alone will contribute $7.2 billion in revenue, accounting for 25% of AMD's Data Center business.
More importantly are the signals from the client side. Meta has signed a procurement commitment with AMD for up to 6 gigawatts—not only the largest AI chip order in AMD's history, but also a clear signal that hyperscalers are pursuing multi-vendor strategies.
AMD's limitations are equally apparent: it receives only about 11% of TSMC's CoWoS capacity allocation, while NVIDIA commands over 60%. This capacity ceiling determines that AMD cannot mount a quantitative challenge to NVIDIA in the short term. However, the very positioning as a "credible second supplier" has already chipped away at the narrative of "NVIDIA-or-nothing."
Erosion Path Three: Software Layer Decoupling - Triton, JAX, and a "CUDA-Free" Future
This is the most easily overlooked but potentially most dangerous long-term path.
CUDA's lock-in relies on a simple fact: AI researchers write code in PyTorch, and PyTorch's底层 runs on CUDA. But what if PyTorch's底层 no longer depends on CUDA?
This is happening. The PyTorch team has verified that using the Triton compiler can achieve "CUDA-Free" inference—running Llama 3 models on H100 and A100, with Triton-generated kernel token throughput comparable to CUDA's. In February 2026, Triton introduced new multi-backend support, allowing the same code to be compiled to different hardware—AMD GPUs, Intel GPUs, even various ASICs.
Google's JAX framework goes further. It was designed from the start to be hardware-agnostic—the same code can run on TPUs, GPUs, or even CPUs. Anthropic's choice to use TPUs for training is largely because JAX allows them to migrate compute platforms without rewriting model code.
What does software-layer decoupling mean? It means a new generation of AI researchers might train state-of-the-art models without ever writing a single line of CUDA code. When developers are no longer locked into the CUDA ecosystem, the hard logic of "must buy NVIDIA" becomes the soft choice of "can buy NVIDIA."
The Rubin Ultra Cancellation: A Watershed of Physical Limits
Returning to the opening news. The cancellation of NVIDIA's 4-chip Rubin Ultra three months after its announcement is seen by SemiAnalysis as "manufacturing execution issues causing further market share loss."
The technical reason is not complex. The original Rubin Ultra planned to integrate 4 compute chips and 16 HBM4E memory modules within a single package using TSMC's CoWoS-L process. However, according to Global Semi Research, the 4-chip configuration suffered from package substrate warpage—the substrate bent in multiple directions, preventing the compute chips from making full contact. Signal transmission failed, rendering the chips inoperable.
TSMC's alternative, CoPoS (panel-level packaging), won't be in mass production until late 2028. NVIDIA couldn't wait—so the revised Rubin Ultra reverted to a 2-chip design, with performance nearly halved.
The symbolic significance of this event outweighs its immediate business impact.
NVIDIA will still sell every Rubin Ultra it can produce. But the "retreat from 4-chip to 2-chip" exposes a deeper problem: NVIDIA's product iteration speed is hitting the wall of physical limits. Larger chips → more complex packaging → higher defect rates → either delays or performance cuts. This is a curve that cannot extend infinitely.
Meanwhile, competitors are bypassing this wall in another way: not by making bigger chips, but by making more specialized chips.
Cracks in Pricing Power
The truly unshakable part of NVIDIA's moat is not the CUDA software ecosystem, but the manufacturing end. It holds 60% of TSMC's advanced CoWoS packaging capacity. This is a physical barrier, not a software one. Competitors can write better frameworks, design more efficient ASICs—but to catch up with NVIDIA in volume, they must first pass the TSMC capacity hurdle.
But herein lies the problem: the manufacturing barrier relies on a third-party foundry. It is not an asset NVIDIA directly controls.
And NVIDIA's 88% gross margin—with an H100 cost of $3,320 and a selling price of $28,000—is built on one premise: customers cannot leave. If that premise shifts from "cannot leave" to "the best性价比 choice," then pricing power is no longer absolute.
Anthropic has proven another path: not pursuing the best chip, but the most suitable chip. Using TPUs for training instead of GPUs isn't because TPUs are faster, but because Google offered enough chips at a good enough price. Using Trainium for inference instead of GPUs isn't because Trainium is stronger, but because AWS is a strategic investor, and Project Rainier bypasses the premium for general-purpose GPUs.
When the world's second-largest AI company downgrades GPUs to the smallest of its three compute platforms, "must buy NVIDIA" is no longer an ironclad rule.
NVIDIA is still the best. No leading AI company has completely abandoned it—Anthropic retains some GPUs for "frontier research exploration," OpenAI's Jalapeño only handles inference not training, Meta's MTIA only covers recommendation systems and content moderation.
But the gap between "only NVIDIA" and "NVIDIA is the most expensive, so use the cheaper options first" is precisely where pricing power is leaking away.
The market has already begun repricing for this possibility. This year, every bearish report from SemiAnalysis has triggered significant volatility in related sectors: early June news of SOCAMM cuts caused Micron to drop 13% in a single day, the June 10 CPO delay controversy forced NVIDIA executives to publicly refute it, and the June 30 Rubin Ultra cancellation reignited the discussion.
Behind these fluctuations, the market is struggling to answer a question it never needed to answer before: If CUDA is not irreplaceable, what is NVIDIA really worth?
Comments