Google's Next-Generation TPU Set for Launch: A Critical Strike Against NVIDIA in the AI Inference Era

Stock News04-20 21:43

As the AI computing battleground shifts decisively from training to inference, Alphabet is preparing to deliver a critical blow. The company plans to announce its next-generation custom AI chip, the Tensor Processing Unit (TPU), at the upcoming Google Cloud Next conference in Las Vegas this week. Amin Vahdat, who oversees Google's AI computing infrastructure and chip development, declined to comment on specific plans for an inference chip designed to accelerate AI output speed but indicated that more information would likely be shared in the "relatively near future."

This development comes as the global AI computing competition undergoes a structural shift, moving from a focus on model training to being dominated by large-scale inference. With the explosive adoption of AI applications and AI agents, the metrics for computing power are shifting from "peak performance" to "cost per token, latency, and energy efficiency"—areas where the AI ASIC approach, exemplified by the TPU, holds a distinct advantage.

**TPU Accelerates Market Penetration: Google Challenges NVIDIA's Dominance** Amid this trend, Google is attempting a direct challenge to NVIDIA, which currently holds an estimated 80% to 90% share of the AI chip market. Within just a few months, Google's exclusively developed and massively deployed TPU AI chips have become one of the most sought-after commodities in the global tech industry. Leading AI developers, including some of Google's largest competitors, are actively stockpiling these chips.

With the full arrival of the AI inference era, surging demand for cloud-based AI inference compute, coupled with the trend of "AI micro-training" focused on integrating large models into business operations, Google's cost-effective, proprietary TPU system is posing a strong challenge to NVIDIA's near-monopoly. Now, the tech giant aims to build on this momentum with a new AI accelerator chip specifically designed for the wave of AI inference.

The global frenzy around generative AI and AI agent deployment has accelerated the AI chip development pace for cloud and chip giants, who are racing to design the fastest and most energy-efficient AI compute infrastructure clusters for advanced, large-scale AI data centers. Companies like Broadcom and its primary rival Marvell are focusing on leveraging their absolute advantages in high-speed interconnect and chip IP to collaborate with cloud giants like Amazon.com, Alphabet, and Microsoft to build custom AI ASIC compute clusters tailored to their specific data center needs. This ASIC business has become highly significant for both companies, contributing to their substantial stock price gains this year. The TPU clusters developed by Broadcom and Google represent a classic example of this AI ASIC technical route.

Driven by significant economic and power constraints, Microsoft, Amazon.com, Alphabet, and Meta Platforms, Inc. are all pushing for in-house development of AI ASIC chips for their cloud systems. The core objective is to achieve better cost and energy efficiency for AI compute clusters. The high cost of building super-large AI data centers, akin to "Stargate" projects, is forcing tech giants to prioritize economic efficiency and maximize "cost per token" and "output per watt" under power constraints, heralding a prosperous era for AI ASIC technology.

Furthermore, compared to the long-term supply shortages, high costs, and supply chain bottlenecks associated with advanced AI GPU clusters like those based on NVIDIA's Blackwell architecture, in-house AI ASICs can provide "secondary production capacity," offer more leverage in procurement negotiations and pricing, and improve cloud service margins. Cloud providers like Google and Microsoft can also engage in co-design across the entire stack—"chip, interconnect, system, compiler/runtime, scheduling, observability/reliability"—thereby increasing infrastructure utilization and lowering Total Cost of Ownership (TCO).

While the AI training segment, where NVIDIA GPUs are dominant, requires greater versatility and rapid iteration of the entire compute system, the AI inference segment, after the large-scale deployment of advanced AI technologies, prioritizes cost per token, latency, and energy efficiency. For instance, Google has explicitly positioned its Ironwood TPU as a generation "born for the AI inference era," emphasizing its performance, energy efficiency, cost-effectiveness, and scalability. However, Amazon's recent actions demonstrate that AI ASICs may also possess significant potential for training large models.

In the medium to long term, the AI ASIC compute ecosystem will undoubtedly continue to erode NVIDIA's monopoly premium and some market share, though it is unlikely to linearly replace the GPU ecosystem. The fundamental reason is that competition in the inference era is no longer just about "peak compute" but encompasses cost per token, power consumption, memory bandwidth utilization, interconnect efficiency, and the total cost of ownership achieved through hardware-software co-design. On these metrics, ASICs, customized for specific workloads with tailored dataflow, compilers, and interconnects, are inherently better positioned to achieve high cost-effectiveness than general-purpose GPUs.

The future AI data center is more likely to be heterogeneous: cutting-edge training and general-purpose cloud computing will continue to be dominated by GPUs, while ultra-large-scale internal inference, Agent workflows, and fixed high-frequency loads will increasingly shift to ASICs.

**A Decade in the Making: How Google's Internal Tool Became a Global Tech Hard Asset** Google's long-gestating chip efforts gained unprecedented attention in October last year when Anthropic PBC, a closely watched AI model developer behind the Claude model, announced an expanded compute supply agreement securing access to up to one million Google TPUs. The following month, Google launched its more advanced Gemini model, announcing it was trained and run partly on TPU platforms to widespread acclaim. Since then, demand for Google TPUs from large enterprises has only intensified.

Meta Platforms, Inc. signed a multi-year, multi-billion-dollar AI infrastructure supply agreement to use TPUs through the Google Cloud platform. Santosh Janardhan, Meta's infrastructure head, noted the company recently received its first significant shipment of cloud TPU compute and is testing the chips to determine their optimal tasks. "There does seem to be potentially a unique advantage on inference," he said, while also cautioning, "Any new platform is not without its hurdles and learning curve."

Anthropic also signed a long-term agreement with Google's TPU partner Broadcom for custom chips that will provide around 3.5 gigawatts of computing capacity starting in 2027. Citadel Securities plans to demonstrate at the Google conference how TPUs enable faster AI model training compared to their previous GPU usage. Abu Dhabi's G42 has held "multiple discussions" about using Google TPUs, according to Talal Al Kaissi, interim CEO of its cloud unit Core42, who expressed being "very bullish" on the talks.

Google is taking new steps to meet current customer demands for cloud AI compute. According to a source, the company is testing a program that would allow firms like Anthropic to run some of their TPUs within their own physical AI data centers instead of exclusively on Google's cloud infrastructure. Vahdat added that Google now allows TPU customers to use external tools like PyTorch and other scheduling software, moving beyond reliance solely on Google's own products.

These changes are helping to alter the perception of the chips. Initially born from Google's own AI compute bottlenecks and long seen as an internal tool, their evolution began when Google's Chief Scientist recognized that even Google could not afford the cost of using existing chips to power services like language translation and speech recognition. This realization drove the development of TPUs, even as Google continues to acquire NVIDIA AI compute and general-purpose AI GPU systems. Concurrently, performance improvements in the central processors Google relies on for AI were slowing. The company decided to build an AI accelerator focused on a narrower set of tasks that incurred the highest costs in AI. Vahdat stated the key idea behind the TPU is that it "solves a small number of problems, but the amount of other compute or general-purpose compute required for those problems is enormous." Vahdat, a former computer science professor, played an early key role in advocating for the optical circuit switches that help link TPUs into supercomputers, countering the then-conventional wisdom that "you don't need to build specialized hardware."

Over the years, Google's TPUs have evolved in tandem with its AI research. A groundbreaking 2017 research paper that catalyzed today's large language models also pushed the TPU team to focus on chips designed for training larger AI systems. Later, Google DeepMind and the chip team observed that TPUs often had significant idle capacity when used for reinforcement learning. The team adjusted the network connections between semiconductors to accelerate data flow and prevent chip idling. This dynamic adjustment continues today as Google weighs how many chips to connect in a single pod or whether hardware precision can be reduced to save costs. "A lot of this is guided by large model experimentation," a Google executive noted. Looking ahead, there is interest in developing an accelerator for edge computing scenarios to place chips closer to users, further reducing latency.

In this process, Google has also built exclusive internal AI verification systems to detect manufacturing defects faster, as even a minor fault can rapidly propagate and cause a model to "completely self-destruct" when the compute system works closely with AI accelerator chips handling massive mathematical operations, according to Paul Barham, a Google distinguished scientist. He recalled an issue from about two years ago that took weeks to diagnose, calling it a "bug from hell." "We now have to do this work for hundreds of thousands of accelerator chips within 10 seconds," he said.

**The Ultimate Challenge in an Unprecedented AI Inference Boom: Supply, Technical Routes, and "Technology Island" Risks** Despite its extensive experience in AI model development, Google faces challenges similar to those of other fabless chip giants like NVIDIA, AMD, and Broadcom: chip development from start to finish typically takes about three years, while AI models evolve much faster. This makes predicting customer needs several years out difficult. "If someone claims they know what Gemini 10 will look like, I'd just say, 'Please give me some of what you're smoking,'" a Google executive remarked.

Barham also worries that the tight feedback loop between AI model creators and hardware designers risks missing new ideas, creating a "cycle that traps you in the patterns where your current software and hardware work well." To balance this, the TPU development team sometimes aims for chips that are "good enough" for various uses, even if not perfect for any single one. Vahdat mentioned that another option is planning two different designs; both might ship if their respective use cases are compelling enough.

As Google's chips grow in popularity, the company faces supply constraints akin to NVIDIA's. An anonymous startup executive stated that their company's TPU usage has been limited by availability, complaining that Google effectively directs all readily available TPU chips to Anthropic. "To a large extent, we are prioritizing existing supply to the more elite teams because, obviously, those teams might be the ones who can push the TPU to its limits on what it's best at," a Google executive said regarding top AI firms.

Looking forward, Google will also need to decide how to allocate TPUs between its own growing, competitive AI model infrastructure services and its expanding client roster. "There are some benefits to building TPUs exclusively for Google, but there are also substantial drawbacks," Vahdat said. "Ultimately, you end up with what we call a 'technology island.' It might be a beautiful island, but its population is limited, its diversity is limited. In the end, it's likely to become less good."

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment