The structure of AI computing demand is being reshaped, and in the race to secure a position in the inference era, domestic computing power manufacturers are increasing their investments. A market consensus has formed that a tipping point for AI inference has arrived. Jensen Huang, Founder and CEO of NVIDIA, has suggested that the scale of AI inference will soon reach billions of times that of training workloads. Xu Bing, Chairman of AI inference chip company Xiwang, believes that by 2026, the demand for AI inference computing will be 4 to 5 times that of training demand, with the leasing price for inference computing power rising nearly 40% in half a year. Market research firm IDC predicts that by 2028, inference workloads will account for 73% of the total. As applications like the OpenClaw intelligent agent scale up, they will further drive the migration of computing demand towards the inference side. Leading manufacturers are moving in high alignment; various actions indicate that the focus of AI computing power is gradually shifting from training to inference, a change that domestic computing power manufacturers cannot ignore.
**The New Wave of AI Inference** AI computing is broadly divided into two stages: first, training the model, a process that can take days or even weeks; subsequently, the trained model responds to actual requests by performing inference. Training is a one-time, batch-oriented investment, sensitive to single-card peak computing power and cluster scale. Inference, however, involves continuous, fragmented operational expenditure, more sensitive to latency, concurrency, and cost per token. As intelligent agents accelerate their penetration into enterprise applications, inference computing has become a fiercely competitive area. Unlike the traditional call-and-response mode of conversational AI, intelligent agents often require multiple rounds of inference, tool calls, and long-context memory when performing tasks, with a single task potentially consuming tens of times more tokens than a traditional dialogue. Currently, while GPUs from NVIDIA dominate the training market, most inference tasks are still handled by CPUs. Although GPUs are fast and powerful, capable of executing billions of simple tasks simultaneously, their primary use is in training. Meanwhile, the computing power required for inference is typically less than what GPUs provide but demands more memory. Insufficient memory can create a bottleneck where the chip cannot fetch data quickly enough, forcing users to wait longer for model responses—a delay that users find intolerable.
User expectations for AI inference latency are actually very high. Li Wentao, Asia-Pacific Cloud Computing Architect Director at cloud service provider Akamai, explained that for first-token latency, gaming users often expect to receive the first token within 15 milliseconds, around 20 milliseconds in e-commerce, approximately 50 milliseconds for agent self-service, and about 100 milliseconds for areas like automated customer service response robots. The varying latency requirements across different scenarios mean that a single specification for a general-purpose inference chip cannot cover all workloads simultaneously. Hardware manufacturers must make trade-offs between throughput, latency, and cost. Jensen Huang believes the value of inference tokens has significantly increased, creating conditions for tiered pricing based on response speed. Using software engineers as an example, he stated that such high-value users are willing to pay for lower-latency tokens to enhance productivity. To this end, NVIDIA has integrated Groq into its CUDA ecosystem, opening up a faster-response but lower-throughput inference niche market alongside the traditional high-throughput path to cover speed-sensitive high-end demands. Huatai Securities stated that cloud services are entering a price hike cycle, further strengthening the scarcity of computing resources. Against this backdrop, the collaborative optimization of domestic models and domestic hardware continues to advance, with domestic accelerator cards and super-node solutions entering a phase of intensive deployment. Both the domestic computing power prosperity and the process of domestic substitution are expected to be continuously strengthened.
**Chip Factions Make Their Moves** Responding to the explosive growth in inference computing demand, Google is the latest tech giant to act. At the Google Cloud Next '26 conference, Google launched two new products in its eighth-generation TPU series: the TPU 8t for training and the TPU 8i for inference. This marks the first time in TPU history that the architecture has been split according to training and inference. The TPU 8i has garnered significant attention; this chip targets real-time AI inference needs, focusing on complex application scenarios like multi-agent collaboration. To achieve faster task response, the TPU 8i emphasizes optimizations in memory configuration, on-chip data throughput, reduced data transmission latency, and improved inter-chip communication efficiency. According to Google, thanks to architectural optimizations, the TPU 8i offers nearly an 80% improvement in performance per dollar for inference tasks. This means that at the same computing cost, enterprises will be able to support larger-scale concurrent AI calls. Amin Vahdat, Senior Vice President of Google Cloud AI & Infrastructure and Chief Technology Officer, pointed out, "With the rise of AI agents, we believe the entire community will benefit if chips can be personalized according to the needs of training and serving."
While overseas giants are frequently active, domestic computing power manufacturers are also closely following technological trends. Observations indicate that Chinese companies are not simply following the paths of overseas giants but are carving out differentiated development paths that meet local needs by leveraging their own technological foundations and domestic application scenarios. Currently, China's AI training and inference demands are experiencing explosive growth. The nation's total computing power has risen to second place globally, accounting for over 30% of the world's total. Internationally renowned investment research firm Bernstein notes that domestic AI chips, represented by Huawei's Ascend and Cambricon's Siyuan series, are accelerating their rise, with their industry status continuously climbing. Technically, Huawei's inference products reflect a Prefill-Decode separation approach. Cambricon emphasizes an integrated architecture and ecosystem for both training and inference. Cambricon has iterated its hardware to the fifth-generation MLUarch microarchitecture; its 7nm Siyuan 590 chip cluster delivers 2.048 PFLOPS of FP16 computing power, supports Chiplet heterogeneous integration and MLU-Link 8-card interconnection, with performance对标 international mainstream products. Research on a new generation microarchitecture and instruction set continues, focusing on optimizing large model training and inference scenarios. Cambricon's technical route has two key pillars. First is its self-developed instruction set, which has been iterated to the fourth-generation commercial instruction set since 2016. The same instruction set supports both training and inference, covering cloud, edge, and device scenarios, providing the underlying foundation for building a unified software ecosystem. Second, the training-inference integrated software platform, Cambricon Neuware, consolidates the underlying software stack and deeply integrates with mainstream frameworks like TensorFlow and PyTorch to shorten the user cycle from model development to deployment. On the customer side, the Siyuan 590 has been commercially deployed in thousand-card-level clusters at major internet companies.
Besides Huawei and Cambricon, other domestic manufacturers are also deploying differentiated strategies on the inference track. Companies like Moore Thread continue to advance along the general-purpose GPU route, focusing on breaking through technical bottlenecks in multi-card interconnection and software toolchains for inference scenarios. AI chip companies like Xiwang target specific segments—such as recommendation systems, long-context inference, and edge-side deployment—to improve efficiency and reduce costs, attempting to find market opportunities outside the general market dominated by giants. A more critical challenge lies in the ecosystem. After nearly two decades of accumulation, CUDA has built a complete system encompassing a programming model, core libraries, distributed frameworks, optimization tools, inference engines, and native support for mainstream frameworks, which constitutes NVIDIA's deepest moat. Huawei announced last year that its CANN compiler and Mind series suites would be open-sourced by the end of 2025, and Cambricon is also continuously opening its NeuWare toolchain, with the intent of lowering the migration barrier for developers.
Comments