The embodied intelligence and humanoid robotics industry in China is undergoing a critical evolution in its technical architecture. The accelerating convergence of Vision-Language-Action (VLA) models with World Models is propelling the sector toward commercial reality. However, the scarcity of high-quality, real-world data remains the core bottleneck constraining large-scale deployment.
According to a report by Goldman Sachs analysts, including Jacqueline Du, following visits to 14 Chinese robotics companies, industry discussions have moved beyond a singular VLA framework toward a multi-modal AI stack focused on execution. Model parameters are rapidly scaling into the range of 40 to 80 billion. There is a growing consensus within the industry that establishing a scalable, human-centric data acquisition architecture is the decisive factor for achieving technological breakthroughs.
For investors, this technological evolution is directly reshaping commercial expectations. The report notes that while application scenarios are expanding into industrial and logistics fields, the vast majority of projects currently remain at the proof-of-concept (POC) stage. The market widely anticipates that large-scale commercial deployment will not truly materialize until between 2027 and 2029, after deployable models are established and tens of millions of hours of high-quality data are accumulated.
Despite near-term challenges, Goldman Sachs maintains a highly optimistic long-term investment outlook for the sector. Progress in multi-modal AI stacks and the establishment of sophisticated data collection systems indicate the industry is nearing widespread application. However, Goldman Sachs advises investors to remain patient, as maintaining quality stability and achieving continuous cost reduction will be core milestones for companies navigating the complex transition from POC to mass commercialization.
**VLA and World Model Convergence: Technical Pathways Accelerate Alignment**
The Goldman Sachs report highlights a significant shift in industry consensus on embodied intelligence model architecture. Companies are no longer confined to traditional, singular VLA models but are rapidly pivoting to the integration of VLA or Vision-Tactile-Language-Action (VTLA) models with World Models. In this new architecture, the World Model no longer exists as a standalone category but functions as a layer alongside the action model, enhancing real-world planning capabilities and robustness by predicting the next state and verifying actions before execution.
Companies such as Galaxea, Galbot, Spirit AI, and One Robotics have all clearly identified the combination of VLA/VTLA with World Models as their next development direction. Consequently, model training scale is climbing from previous single-digit billion parameters into the vast range of 40 to 80 billion parameters. Multiple industry sources emphasized to Goldman Sachs that these multi-modal stacks will require several more iterations before reaching deployable, quality-consistent standards.
Furthermore, companies like PaXini have underscored the importance of tactile feedback (VTLA) in physical interaction, planning to launch models dominated by tactile sensing to address the shortcomings of purely visual approaches in force-control tasks.
**The Data Bottleneck: From "Recipe Debates" to Acquisition Architecture**
High-quality, multi-dimensional real-world data remains the primary bottleneck hindering practical deployment. Goldman Sachs observes that the industry's focus has shifted from broad debates over "data recipes" to how to build scalable architectures capable of reliably producing high-quality data.
In data collection, human-centric and egocentric acquisition methods (such as Universal Manipulation Interfaces and first-person wearable devices) are increasingly preferred, especially when companies need to preserve natural motion, capture rich contact interactions, and enable cross-platform transfer. Investment strategies are diverging along two paths: some companies favor building centralized data factories with government support, exemplified by PaXini currently operating five such factories nationwide; others, like Galaxea, Spirit AI, and One Robotics, lean toward constructing distributed deployment loops through deployed systems, VR, and client-side collection.
Data itself is becoming a significant revenue stream. The report notes that several companies anticipate data-related income will constitute a markedly larger share of total revenue by 2026. Among them, UBTech expects strong, sustained government demand for data factories, which will support its revenue growth and data accumulation.
**Commercialization Progress: Focus on Industry and Logistics, Pragmatic Scaling**
Currently, the commercialization of humanoid robots is gradually extending to industrial handling, logistics workflows, and some structured commercial scenarios. According to the Goldman Sachs report, near-term core opportunities are concentrated in standardized or semi-structured processes like sorting, material handling, pick-and-place, and inspection.
Adoption in the industrial sector strictly follows a phased path. Companies typically undergo a 3 to 6-month POC phase (averaging 2 to 3 iterations), followed by small-batch testing of fewer than 50 units per batch. After an approximately 12-month validation period, they can enter a pilot deployment phase involving about 50 to 100 units per customer. In logistics, companies like Geek+ emphasize a "scenario-first" philosophy, decomposing complex tasks into sub-tasks with clear boundaries, prioritizing reliability over generality at the current stage.
Regarding hardware form factors and cost-reduction paths, the market exhibits a strong pragmatic tendency. Goldman Sachs points out that, due to current model limitations and cost considerations, many manufacturers currently prefer a "wheeled base with a two or three-finger gripper" configuration. This form factor is currently sufficient to cover 70% to 90% of industrial application scenarios, while the ultimate form of a bipedal robot with a five-finger dexterous hand will take more time to materialize.
**Cost Reduction Paths: Scale Effects Dominate, Hardware Forms Trend Pragmatic**
In the competitive market, cost reduction primarily relies on economies of scale and companies' customized choices regarding architecture, components, and deployment forms. For full-size humanoid robot players, full-stack in-house R&D control remains the most common method for cost management.
Across the industrial chain, companies are establishing their competitive advantages. Linkerbot states it holds a leading share in the global high-degree-of-freedom dexterous hand market and has achieved significantly lower pricing than overseas competitors through self-developed joint modules. Mech-Mind focuses on 3D vision systems for industrial scenes, with its core clientele concentrated in automotive and battery manufacturing. In the traditional industrial robot field, Estun Automation's management emphasizes that the company's strategy has shifted significantly from merely pursuing market share and shipment volume to prioritizing product portfolio enhancement, profitability, and growth quality to navigate increasingly intense domestic price competition.
Comments