Yang Zhilin Responds: Kimi K2 Trained on H800 GPUs – But "Only Cost $4.6M"?

Deep News2025-11-11

The claim that Kimi K2 Thinking was trained for just $4.6 million has sparked discussions. Yang Zhilin, co-founder of Moonshot AI, clarified that this figure is unofficial, as training costs are difficult to quantify due to significant research and experimentation expenses.

The team revealed they used NVIDIA H800 GPUs with Infiniband, operating with fewer GPUs than industry giants but maximizing each card’s efficiency. Despite this, Kimi K2’s performance and cost-effectiveness are driving a migration wave in Silicon Valley.

Investor Chamath Palihapitiya shared that his new company shifted AI workloads to Kimi K2 for its superior speed and affordability. Vercel’s CEO cited internal tests showing Kimi K2 is 5x faster and 50% more accurate than closed-source models. Users of Claude Code are also swapping to Kimi K2 configurations.

Comparisons to DeepSeek V3’s reported $5.6M training cost have fueled debates: Are closed-source giants’ valuations justified when open-source alternatives deliver equal or better performance at lower costs? Some argue Moonshot AI itself deserves reevaluation.

**How Kimi K2 Achieved Efficiency** Technical analyses highlight Kimi K2’s optimization of open-source foundations, particularly its architectural similarities to DeepSeek. Key tweaks include: - Expanding MoE experts from 256 to 384 for greater knowledge capacity. - Reducing activated parameters per inference from ~37B to 32B to cut costs. - Enlarging the vocabulary to 160k and trimming dense feedforward blocks for computational efficiency.

Engineering innovations played a pivotal role: - The proprietary *MuonClip* optimizer enabled stable gradients, achieving "zero training crashes" across 15.5T tokens without manual intervention. - Quantization-aware training (QAT) allowed native INT4 inference, doubling speed with minimal performance loss.

**Moonshot AI’s Reddit AMA Highlights** In a 3-hour Reddit AMA on r/LocalLLaMA, co-founders Yang Zhilin, Zhou Xinyu, and Wu Yuxin addressed ~200 questions: - **Next-gen architecture (K3):** Experimental KDA (Key-Dependent Attention) hybrid mechanism outperformed RoPE in speed and benchmarks, potentially debuting in K3. - **Roadmap:** A Claude Code-like *Kimi Code* is underway; vision-language models are in development but delayed by data challenges. Longer context windows may return after cost optimizations. - **K2’s quirks:** The team acknowledged its "overthinking" inefficiency, pledging to streamline reasoning in future versions.

When asked about Kimi’s refusal to over-praise users, they attributed it to deliberate dataset design. Its distinctive writing style stems from pretraining knowledge and post-training "taste" tuning via RL.

As for K3’s release? The team joked, "Stay tuned."

[1] Reddit AMA link omitted per guidelines.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment