Physical AI is transitioning from a conceptual framework to industrial reality. According to a recent in-depth industry report by Zheshang Securities, following the waves of Perceptual AI, Generative AI, and Agentic AI, Physical AI is poised to be the next major stage in AI evolution. Its core objective is to enable models to comprehend and predict real-world states, thereby driving profound transformations in sectors such as autonomous driving, embodied intelligence, and industrial software.
Regarding market scale, Coatue Management estimates the Physical AI market to be at least $6 trillion, approximately 50% larger than the digital AI market. At CES 2026, NVIDIA CEO Jensen Huang stated that Physical AI has the potential to reshape manufacturing and logistics industries valued at around $50 trillion. Concurrently, top-tier academics and tech giants are actively entering the field: AMI Labs, founded by Turing Award winner Yann LeCun, secured a $10.3 billion seed funding round; World Labs, co-founded by AI pioneer Fei-Fei Li, completed a new $1 billion funding round, achieving a valuation exceeding $50 billion in under two years; NVIDIA announced its next-generation Feynman chip, specifically designed for Physical AI and slated for release in 2028.
Zheshang Securities posits that Physical AI currently lacks a fixed implementation paradigm and requires the combined support of world models and VLAs (Vision-Language-Action models). Autonomous driving, embodied intelligence, and industrial software constitute the three most critical application scenarios for Physical AI, with autonomous driving expected to be the first to successfully establish both "data closed-loops" and "commercial closed-loops." The report recommends focusing on companies with world model capabilities, as well as hardware and software players within these three key application areas.
**Technical Definition: The Paradigm Shift from Generative AI to Physical AI**
The Zheshang Securities report defines Physical AI as AI systems capable of understanding the real world, which must answer two core questions: how the world will change next, and how the world will react to physical actions. Unlike Generative AI, which is confined to language understanding and content generation within the digital realm, Physical AI operates in the real physical world. Its core capabilities encompass perception, action, and control, with its value realized in tangible scenarios like industrial control, embodied intelligence, and unmanned driving.
Jensen Huang summarized the evolution of AI technology into three generations of paradigms: from Perceptual AI to Generative AI, then to Agentic AI, with the next stop being Physical AI—"AI that can run, reason, plan, and act."
The model capabilities of Physical AI have also evolved through three stages. The 1.0 era relied on hard-coded rules, resulting in poor scenario adaptability. The 2.0 era shifted to data-driven approaches, learning through imitation of massive datasets but lacking genuine understanding of the physical world. The current era has entered the 3.0 reasoning-driven stage, centered on world models + VLAs + reinforcement learning. This stage endows AI with environmental reasoning, causal understanding, and planning capabilities, supporting closed-loop decision-making for complex tasks.
**Core Technologies: World Models and VLAs Yet to Converge on a Unified Paradigm**
The Zheshang Securities report emphasizes that the current implementation of Physical AI relies on two core components: world models and VLAs, both of which are still in stages where technical approaches have not yet converged.
The original concept of world models stems from the field of reinforcement learning, referring to an AI agent constructing an internal representation of the external world to mentally simulate action plans. Its core value lies in the irreversibility of the real world; traditional simulations cannot support agents in repeated trial-and-error cycles of "decision-making—observing outcomes." World models, however, can construct virtual environments that approximate the real world, supporting AI training at lower cost and with greater safety.
In a CNBC New Year interview in 2026, Google DeepMind CEO Demis Hassabis suggested that AGI might be missing one piece of the puzzle, which could very well be world models.
Currently, four main technical approaches dominate the academic field for world models: observation-level generative models, strong in "realism," represented by Sora; latent space models, strong in "efficiency," represented by the JEPA series; reinforcement learning-oriented models, strong in "decision-making," represented by the Dreamer series; and object-centric models, strong in "interpretability," represented by SlotFormer. Fei-Fei Li believes that world models need to possess three capabilities: generative, multimodal, and interactive.
VLA models (Vision-Language-Action models), through end-to-end learning, map the task semantics of visual and language modalities to specific operations within a unified model, bypassing the need for manually designed rules and module interfacing. Since Google DeepMind released RT-2 in 2023, VLA research has entered a new phase. Stanford released the first open-source 7B parameter general-purpose robot manipulation VLA model, OpenVLA, in 2024. NVIDIA released the open-source VLA foundation model for general-purpose humanoid robots, GR00T N1, in 2025.
**Three Major Application Scenarios: Autonomous Driving, Embodied Intelligence, and Industrial Software**
Autonomous driving is considered by Zheshang Securities as the scenario most likely to first achieve both the "data closed-loop" and "commercial closed-loop" for Physical AI. With approximately 13 trillion miles driven globally by vehicles annually, the sustainable collection of multimodal real-world data, clear commercial monetization models, and a scalable, replicable industry chain provide unique advantages for building autonomous driving systems.
At the 2026 Beijing Auto Show, Physical AI emerged as a hidden main theme. Among autonomous driving solution providers, Pony.ai CTO Tiancheng Lou released World Model 2.0, with a core breakthrough in granting AI self-diagnosis and directed evolution capabilities; Momenta officially launched the R7 reinforcement learning world model; QCraft announced a strategic shift, fully upgrading its focus from "unmanned driving" to "general Physical AI." Among automakers, XPeng plans to increase its 2026 Physical AI-related R&D investment to 7 billion yuan; Geely released the WAM World Action Model and announced deep collaboration with NVIDIA in the Physical AI field; Chery announced a global strategic partnership with NVIDIA, focusing on three key areas: assisted driving, cabin AI, and robotics.
Embodied intelligence is defined by Zheshang Securities as the core carrier for the "Perception—Understanding—Reasoning—Action" closed-loop of Physical AI. The evolution of the Physical AI technology stack is driving robotics from "rigid automation" towards "true autonomy." Compared to traditional robots, Physical AI-enabled robots can handle unpredictable and unknown components, reduce manual coding workload, and accelerate deployment speed.
Industrial software is positioned as the "control console" for the training, validation, deployment, and operation of Physical AI. The report argues that the non-replicable nature of industrial software data, high security and compliance requirements, and the complexity of cloud-edge-device collaboration create strong moats. The relationship between industrial software and Physical AI is complementary and symbiotic, with mutual empowerment: industrial software provides the physical foundation, high-quality data, and validation environment for Physical AI; Physical AI, in turn, provides intelligent acceleration, automated decision-making, and closed-loop optimization capabilities for industrial software. Key beneficiary areas include CAE simulation, digital twins, industrial control, Industrial IoT, energy scheduling, and EDA/CAD.
Comments