NVIDIA's CEO Jensen Huang informed analysts a year ago that the transition for customers from previous-generation AI server chips to the new Blackwell AI chips was proving "challenging" due to the significantly increased complexity of the new generation. He stated that to enhance chip performance, "adjustments were necessary across all aspects, including server chassis, system architecture, hardware configuration, and power supply systems."
In fact, for NVIDIA's core clients, deploying and scaling Blackwell servers was initially a formidable challenge. According to two NVIDIA employees serving OpenAI and other major clients, and a Meta employee involved in addressing related issues, for most of last year, OpenAI, Meta Platforms, and their partnered cloud service providers struggled to stably build and utilize these systems. These sources indicated that, by contrast, clients were able to successfully deploy and put into operation NVIDIA's previous AI chips within weeks of receiving them, prior to Blackwell.
The various difficulties encountered by NVIDIA's core clients using its Blackwell series chips, particularly the Grace Blackwell model, do not appear to have severely impacted the chip giant's business. NVIDIA remains the world's most valuable company with a market capitalization of $4.24 trillion and has largely resolved the technical issues that hindered rapid, large-scale deployment of the series by major clients.
However, if future new chips from NVIDIA present similar deployment challenges, competitors like Google could seize an opportunity—provided these rival manufacturers can help clients deploy chips faster and at scale to support the development of cutting-edge AI technologies. Such issues could also lead to reduced profits for cloud providers struggling with scale implementation and slow the development progress of AI companies relying on these chips for more advanced models.
This report is based on interviews with NVIDIA and Meta employees, staff from cloud providers using NVIDIA chips, and partners involved in installing NVIDIA chips in data centers.
For clients like OpenAI and Meta, the inability to build chip clusters at the expected scale limits their capacity to train larger AI models. According to one NVIDIA employee, although clients have not publicly complained, some have privately expressed dissatisfaction to their NVIDIA contacts.
To compensate affected clients, NVIDIA provided partial refunds and discounts last year related to the Grace Blackwell chip issues, according to a cloud provider executive and an NVIDIA employee involved in the discussions.
NVIDIA and cloud provider executives stated the main problems occurred with servers linking 72 Grace Blackwell chipsets—a design intended to significantly boost inter-chip communication speed and enable协同operation within a single system. These servers can interconnect to form massive clusters providing computational power for intensive AI model training.
An NVIDIA spokesperson said the company addressed concerns about slow Grace Blackwell system deployment in 2024, stating at the time that such systems are "the most advanced computers ever built" and their deployment requires "joint engineering development with customers."
The statement added: "NVIDIA is working deeply with leading cloud providers, whose teams have become integral to our engineering system and processes; such engineering iterations are normal industry practice and expected."
Sachin Katti, an OpenAI infrastructure executive, stated that the startup's collaboration with NVIDIA is "proceeding exactly as planned to support our development roadmap. We are fully utilizing all available NVIDIA chips for model training and inference, which is enabling rapid iteration and product deployment, as evidenced by our recent model releases."
A Meta spokesperson declined to comment.
There are indications NVIDIA has learned from these deployment challenges. The company has not only optimized the existing Grace Blackwell systems but also made improvements for servers based on the next-generation Vera Rubin chips scheduled for later this year.
According to two people involved in chip design, NVIDIA introduced a more powerful, upgraded version of the Grace Blackwell chip last year to ensure better operational stability than the initial product. They stated this upgraded chip, named GB300, features improvements in thermal management, core materials, and connector quality.
A Meta employee familiar with the situation said engineers who experienced technical faults with the initial Grace Blackwell systems found the new chips much easier to interconnect. Another NVIDIA employee serving OpenAI revealed that some clients, including OpenAI, have modified outstanding Grace Blackwell chip orders to instead purchase more of this upgraded product.
Last autumn, NVIDIA informed investors that most revenue from the Blackwell series already came from the optimized Grace Blackwell servers, with mass delivery planned for this year.
Elon Musk's xAI, which heavily relies on NVIDIA chips, appears to be ahead in deploying Grace Blackwell servers. Last October, the company deployed and operationalized approximately 100,000 of these chips at its data center in Memphis, though it's unclear if this strategy yielded superior results.
NVIDIA's goal in developing Blackwell chips was clear: to help clients train AI models with far better scale and cost-efficiency compared to previous generations.
In NVIDIA's prior servers, clients could interconnect a maximum of 8 chips, with relatively slow communication speeds. The core design of the Blackwell series is to link 72 Grace Blackwell chips within a single server, reducing data transfer between different servers, thereby freeing up data center network resources to support the training and operation of larger AI models.
According to an Oracle employee involved in building chip clusters, constructing large-scale clusters this way can also improve the quality of AI models trained on them, as the system was designed to reduce hardware failures common during training.
However, this new design itself had vulnerabilities. Highly integrating numerous chips meant a failure in a single chip could trigger a chain reaction, causing crashes or interruptions across entire clusters comprising thousands of chips. According to three people who experienced such failures, the cost for companies to restart interrupted training from the last saved checkpoint ranges from thousands to millions of dollars.
The rollout of NVIDIA's Grace Blackwell systems was fraught with issues from the start. In summer 2024, chip design flaws delayed mass production, and various problems emerged. A year ago, after the first Blackwell chips were delivered, server racks experienced issues like overheating and connection failures, leading major clients including Microsoft, Amazon Web Services, Google, and Meta to reduce orders and purchase previous-generation chips instead.
Employees from multiple cloud providers that ordered Grace Blackwell chips stated they believed NVIDIA delivered the products before the related software and hardware were fully debugged.
A former NVIDIA executive defended this strategy, stating that the growing pains of the 72-chip Grace Blackwell server exemplify Jensen Huang's philosophy of pushing technological boundaries rather than playing it safe. Current and former NVIDIA employees believe it was unrealistic to expect NVIDIA to precisely predict chip performance in the large-scale deployment scenarios of clients like OpenAI and Meta.
There are indications that OpenAI has now achieved scaled usage of NVIDIA's 72-chip interconnected servers. On Thursday, OpenAI announced that its latest AI code model, GPT-5.3-Codex, was developed entirely with this dedicated system "co-designed, providing training compute, and supporting deployment."
Throughout last year, deployment delays caused losses for some of OpenAI's cloud service partners, according to executives from two cloud providers. These companies had invested heavily in Grace Blackwell chips expecting rapid deployment and cost recovery, as cloud providers only generate revenue once clients start using the chips.
To alleviate financial pressure, some cloud providers negotiated discount agreements with NVIDIA last year, allowing them to pay a reduced percentage based on actual usage, according to a cloud executive involved in the talks.
Additionally, NVIDIA processed refunds for some clients who returned servers, as reported by an NVIDIA employee and a staff member from a manufacturing partner.
When introducing new technology, cloud providers typically bear the costs upfront, only realizing revenue once clients use the hardware, leading to lower profit margins during this phase. A document showed that in the three months ending last August, Oracle lost nearly $100 million on leasing Blackwell series chips, primarily due to a significant lag between the time Oracle completed server debugging and delivered them to clients, and when clients like OpenAI began usage and paid rent.
An internal presentation prepared for Oracle's cloud executives noted that the Grace Blackwell chip leasing business had a negative gross margin, mainly impacted by deployment issues at OpenAI's data center in Abilene, Texas, and delayed customer acceptance cycles.
Oracle later told investors that its AI cloud business is expected to eventually achieve 30% to 40% gross margins, accounting for the investment period before data centers become fully operational.
An Oracle spokesperson declined to comment.
Comments