Who is Consuming the 5 Trillion Token Model Compute Power?

Deep News03-08

From February 9 to February 15, 2026, the online AI hosting platform OpenRouter released data showing that the weekly token usage of Chinese large language models reached 4.12 trillion tokens, surpassing the 2.94 trillion tokens of US models for the first time in history. OpenRouter is a platform that aggregates global large model APIs, often regarded as a barometer for the popularity and real-world application intensity of AI models worldwide.

In the following week of February 16 to February 22, the weekly token usage of Chinese models surged further to 5.16 trillion tokens, marking a 127% increase over three weeks. Among the top five models globally by weekly usage that week, four were Chinese: MiniMax M2.5, Moonshot AI's Kimi K2.5, KNOWLEDGE ATLAS's GLM-5, and DeepSeek's DeepSeek V3.2.

Despite nearly 47.17% of OpenRouter's users being from the US, with Chinese developers accounting for only 6.01%, this data indicates a rapid increase in overseas developers' willingness to use Chinese models. Token usage is a core metric for measuring the utilization intensity, commercial value, and penetration depth of large models. The high-frequency usage by international developers signals a shift in the AI industry's focus from one-time training costs to high-frequency,常态化 application inference.

This shift in application focus is directly leading to changes in downstream procurement standards, creating opportunities for domestic AI chip manufacturers to accelerate their market entry.

To understand where these 5 trillion tokens are flowing, it is essential to observe changes in how users interact with AI. According to the "2025 AI Usage Report" jointly released by OpenRouter and venture capital firm a16z, the proportion of tokens used for programming tasks on the platform has increased from 11% at the beginning of 2025 to over 50%, becoming the largest single usage category. This change reflects a transition in AI application patterns from a "Q&A mode" to an "Agent mode."

In the early Q&A mode, a user asks a question and the model provides an answer, with each interaction consuming a few hundred to a few thousand tokens. Consumption stops when the user stops asking. However, in Agent mode, AI continuously executes multi-step tasks in the background. A representative from a Shanghai-based computing chip manufacturer explained that in programming scenarios, an agent, after receiving an instruction, goes through cycles of writing code, running tests, identifying errors, self-correcting, and running again. To allow the machine to remember previous operations, each call requires carrying the complete conversation history.

For instance, newer domestic models like KNOWLEDGE ATLAS's GLM-5 now support ultra-long context windows of 200K (approximately 200,000 tokens). This multi-round self-correction and toolchain cascading pattern leads to a geometric increase in token throughput per active session. Furthermore, the implementation of multimodal applications has further driven up consumption. Public data indicates that generating a 10-second, 1080p video using the popular Seedance 2.0 video model consumes approximately 350,000 tokens. Video generation consumes hundreds of times more tokens per unit time than traditional text-based Q&A.

The current trillions of tokens in usage are no longer primarily driven by testing scenarios but are supported by a batch of high-frequency, scalable, and sustainably paid commercial applications. According to an analysis of the domestic large model commercial ecosystem, the main paying sectors currently include internet services, finance, cross-border e-commerce, and the entertainment industry.

Specific application scenarios encompass three categories: First, enterprise-level applications like intelligent customer service, smart marketing, code-assisted development, and office automation tools, which have seen large-scale deployment in finance, e-commerce, and gaming. Second, generative content services within internet platforms, including intelligent search, conversational assistants, and virtual characters. Third, AIGC production tools, such as short video script generation, advertising copy creation, and cross-border e-commerce product description generation. The common feature of these industries is a high proportion of text or multimodal content generation needs in their business processes, and the ability to bear the computational costs associated with large model services.

At the "Domestic Ten-Thousand-Card Computing Power Empowering Large Model Development Seminar" held in Zhengzhou on February 10, 2026, a researcher from the Chinese Academy of Sciences stated that the core drivers of industry development remain massive computing power, big data, and large parameters. However, as performance gains from increasing model parameters reach a bottleneck, the industry is moving towards agents, synthetic data, and inference computing. Data determines the height of AI, and the environment will dictate the direction of model evolution.

The change in application patterns explains the surge in token usage, but why are domestic models able to handle this global high-frequency demand under the new paradigm? The aforementioned chip manufacturer representative noted that domestic models like MiniMax M2.5 and Kimi K2.5 widely adopt a Mixture of Experts (MoE) architecture. Unlike traditional dense models that activate all parameters for every computation, the MoE architecture activates specific expert networks on demand, reducing GPU memory usage during inference by approximately 60% and significantly improving throughput.

Dense models require every parameter in the neural network to participate in computing each input request, meaning computational and memory requirements grow linearly with model size. The MoE architecture changes this by dividing model parameters into functional groups, or "experts." A routing system identifies the task requirement and activates only the relevant parameters. This division of labor allows models to maintain a large parameter count while drastically reducing the effective compute required per inference.

Technical optimizations are directly reflected in pricing. Currently, the input price for Chinese models is approximately $0.3 per million tokens, while some comparable overseas products cost around $5. Additionally, power costs are a variable; computing nodes in Western China have power costs of about ¥0.2 to ¥0.3 per kWh, compared to ¥1 to ¥1.5 in Europe and the US.

This cost advantage has contributed to a rebalancing of supply and demand. During the 2026 Chinese New Year period, domestic models saw a wave of密集 releases. On February 11, KNOWLEDGE ATLAS launched its foundation model GLM-5; on February 12, MiniMax open-sourced its text model M2.5; on February 14, ByteDance released its Doubao large model 2.0 series. Concurrently, major tech companies launched AI application promotions: Baidu invested ¥500 million, Tencent's Yuanbao offered ¥1 billion, and Alibaba introduced a ¥3 billion free usage plan.

密集 application deployment has also driven a sharp increase in computing power consumption. Public data shows China's daily token consumption climbed from 100 billion tokens in early 2024 to around 1.8 trillion tokens by February 2026. Driven by this explosive demand, domestic large model vendors, previously caught in price wars, began collectively shifting strategy. For example, on February 12, 2026, KNOWLEDGE ATLAS AI announced an increase in its API call prices alongside its new model release, with some overseas subscription prices rising 30% to 60%, and API call prices increasing by up to 100%. KNOWLEDGE ATLAS responded that rapid growth in user scale and call volume necessitated increased computing power investment. Meanwhile, Moonshot AI's Kimi K2.5 saw its overseas revenue exceed domestic revenue within a month of release.

This indicates that large model companies are moving away from loss-leading price wars and beginning to generate substantial business revenue. The flow of 5 trillion tokens shows that AI is transforming from a simple dialog box into an industrial process running automatically in the background for finance, e-commerce, and programming scenarios.

Looking upstream along this exponentially growing data stream, the criteria for selecting computing hardware by intelligent computing centers that carry these tasks are also changing.

The core metric in the computing power market is shifting from acquiring computing cards to calculating the cost per unit of output. In 2025, China's GPU computing power rental market experienced price declines. For instance, the rental price for an NVIDIA H100 computing card dropped from a peak of over ¥90 per hour to between ¥15 and ¥20; the price for an A100 fell to ¥3 to ¥5 per hour.

This price trend reflects a change in procurement logic. In the early stages of large model development, due to the scarcity of high-performance chips, the market was in a resource-hoarding phase, pursuing peak computing power per card. However, as inference workloads become常态化, companies are beginning to calculate the Total Cost of Ownership (TCO). Clients are no longer focused solely on the absolute peak computing power of a single card but are calculating how much throughput they get per yuan invested and how many tokens are processed per watt consumed.

The current demand for computing power exhibits dual characteristics of inference and rendering. Beyond text generation, scenarios like AI agent cloud platforms, cloud phones, digital twins, and industrial simulation, which require real-time interaction, are driving procurement of full-feature GPUs. As the industry enters the inference phase, domestic chips are finding an entry window—while the training phase heavily relies on NVIDIA's CUDA ecosystem, inference tasks focus more on energy efficiency, stability, and supply security.

An analyst from TrendForce预计 that the proportion of inference AI servers in the overall shipment structure is expected to rise to 44% in 2026, up 3 percentage points from 2025. Compared to large model training clusters that pursue computing density, inference servers place greater emphasis on cost-performance and energy efficiency in their underlying hardware architecture design. The stringent requirements for advanced packaging and HBM at the inference end have been somewhat relaxed. This relaxation of specifications benefits Chinese local enterprises, allowing them to develop inference chips with medium-scale computing power despite restrictions on HBM access, creating opportunities in areas like internet services and vehicle systems.

As the industry's focus shifts to inference and fine-tuning, the investment structure for enterprise clients planning intelligent computing center budgets has also changed. In the past, during the training phase, investment was concentrated on high-end training GPU clusters, ultra-high-speed interconnection networks, and high-performance storage systems to meet long-duration, high-parallelism training needs. With the growth of inference demand, companies are increasingly adopting inference-optimized GPUs, domestic AI chips, or heterogeneous computing combinations. The proportion of investment in software platforms, computing power scheduling, model optimization tools, and inference acceleration frameworks is gradually increasing.

Domestic AI chips are in a transition phase from usability to large-scale commercial use in underlying software ecosystem development. Major domestic chip manufacturers have established basic toolchain systems including driver layers, compilers, operator libraries, and runtime environments, capable of supporting large model training, inference, and intelligent computing center deployment.

Computing power procurement has also moved from single-card testing to the stage of system-level engineering delivery. Because inference tasks are extremely sensitive to response latency, for clusters with ten-thousand-card scale, network communication and cooling capabilities often become more decisive than the "benchmark score" of a single computing card during协同 work.

On February 5, 2026, the Zhengzhou core node of the National Supercomputing Internet officially began trial operation. This node deployed three scaleX ten-thousand-card super clusters provided by Dawning Information Industry Co.,Ltd., offering computing power from over 30,000 domestic accelerator cards—the first domestically developed AI computing pool in China to achieve a 30,000-card deployment and actual operation.

The launch of the Zhengzhou core node validates the engineering capabilities of domestic computing infrastructure. The industry has shifted from early point breakthroughs to large-scale deployment. Previously, various vendors had their own systems for hardware design, software stacks, and interconnection protocols, making it difficult to schedule computing resources across platforms. The deployment of ten-thousand-card clusters is not just an IT challenge but also a cross-disciplinary engineering challenge involving cooling and power supply. If there is a短板 in any single technology, it directly drags down the efficiency of the entire system. Current system development has broken through the bottleneck of integrating traditional IT with other engineering technologies.

This domestic computing system has already completed adaptation for thousands of applications. According to the operation director of the Zhengzhou node, 645 third-party vendors have adapted to the node in the first phase of the National Supercomputing Internet, with over 7,200 software packages and source codes接入. Supported by these resources, over 70% of domestic new energy vehicles conduct fluid dynamics and collision simulation experiments on this platform. For instance, to address potential regulatory changes banning hidden door handles in new energy vehicles, car manufacturers need to use large clusters to simulate the impact on energy efficiency and wind resistance if handles are changed to an open style.

Additionally, this cluster supports the R&D of top-tier fabrics for domestic down jackets and provides intelligent computing resources for international luxury brands to optimize their designs within China.

This logic of downstream application爆发 forcing upstream infrastructure upgrades is also translating into business revenue for domestic chip manufacturers. For example, according to the results of a China Mobile AI general computing equipment procurement project in early 2026, valued at over ¥5 billion, 7,499 inference AI servers were procured. Huawei's Ascend ecosystem vendors secured ¥3.4 billion of this order, and the share of domestic companies like Kunlun Technology increased significantly.

Furthermore, the performance of domestic computing power manufacturers has seen explosive growth over the past year. For instance, on February 27, Cambricon Technologies Corporation Limited's performance report showed 2025 revenue growth of 453.21%, reaching ¥6.497 billion, and achieving a net profit of ¥2.059 billion—the company's first annual profit since listing. Cambricon attributed revenue growth to rising computing power demand in the AI industry and the promotion of application scenario deployment. Similarly, Moore Thread, MetaX, and Hygon also reported significant performance improvements for 2025.

The founder and CEO of a domestic high-performance GPU chip provider stated that physical AI is approaching a critical inflection point, and its realization path depends on a closed loop from virtual to real. Graphics rendering, as the foundation for building simulation and digital twins, is the first step in connecting AI with the physical world. The company has established a foothold in AI inference and cloud rendering, with its products already deployed in over ten leading internet companies, several operators, and central state-owned enterprises.

According to Bernstein's "2025 China AI Chip Industry Report," the market penetration rate of local Chinese AI chip brands increased from about 29% in 2024 to 42% in 2025. This means that behind the持续 consumption of trillions of tokens, domestic computing power, through its adaptation to industry applications and optimization of cost structures, is completing the leap from a marginal alternative to a market-first choice.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Who is Consuming the 5 Trillion Token Model Compute Power?

Comments