Leading global technology companies are making significant investments in world models, indicating the field is entering a period of accelerated development. Recent funding rounds of $1 billion each by prominent academics Fei-Fei Li and Yann LeCun have drawn considerable market attention. This follows earlier moves by AI leaders like NVIDIA, Alphabet, and OpenAI, who have also identified world models as a potential pathway to next-generation intelligence. However, there is no single, unified definition of a world model. Major technical approaches, represented by OpenAI's Sora, NVIDIA, Fei-Fei Li, and Yann LeCun, offer different interpretations. This report addresses three fundamental questions: 1) What is a world model? 2) What are the differences between the various schools of thought? 3) What are the potential application scenarios for world models?
▍Why Focus on World Models? Major Investments Signal a High-Growth Phase. NVIDIA's Cosmos world model was prominently featured again at the GTC conference on March 16, 2026, marking its second consecutive appearance at major CES and GTC summits. On February 19, Fei-Fei Li's World Labs announced the completion of its latest $1 billion funding round dedicated to world model R&D. On March 10, Yann LeCun's AMI secured $1.03 billion in seed funding, setting a record for Europe's AI sector. Models like OpenAI's Sora and Alphabet's Genie are also seen as strong contenders in the world model arena. The concentrated efforts of top-tier scholars and leading tech firms, combined with the rapid iteration of multimodal model capabilities, are gradually forming an industry consensus that world models could lead the next wave of artificial intelligence development.
▍What is a World Model? No Unified Definition, Four Main Technical Paths Emerge. Four primary technical paths are shaping the mainstream exploration of world models, each with a distinct focus: 1) The Video Generation School (e.g., Sora): Views world models as pixel-level video generators capable of "free imagination." 2) The Physical AI School (e.g., NVIDIA): Sees world models as Physical AI infrastructure for generating large-scale simulated environments. 3) The Spatial Intelligence School (e.g., Fei-Fei Li): Defines world models as 3D spatial intelligence that understands the three-dimensional spatial relationships of objects. 4) The Causal Reasoning School (e.g., Yann LeCun): Considers world models as causal reasoning intelligence capable of inferring physical laws and predicting future states within abstract logic. These four schools are embarking on the journey toward next-generation AI from different starting points, aiming to converge on the ultimate goal of a world model. This convergence aims to address the shortcomings of language models in areas like visual generation, action interaction, spatial understanding, and causal reasoning, thereby guiding the future of AI.
▍What are the Differences Between the Schools? Different Perspectives, Shared Core Essence. The video generation school emphasizes pixel-level reconstruction of the world, the Physical AI school focuses on recreating real-world scenarios, the spatial intelligence school prioritizes 3D reconstruction, and the causal reasoning school stresses the reconstruction of abstract causal logic. These four directions merely describe the boundaries of a world model's capabilities from different dimensions; they are not entirely mutually exclusive technical paths. During industrial development, the technologies of these schools continue to iterate, integrate, learn from each other, and complement each other's strengths. Ultimately, a world model could form a unified mathematical abstraction: given the state of the world at one moment and an action, it generates the state at the next moment. For example, given a video frame at time T and the action of a robot within it, the model predicts the frame at time T+1. Unlike language models that generate the next token based on historical tokens, the key differences for world models are that the "tokens" are video modalities and actions are introduced. At its core, a world model emphasizes how an agent can change the world.
▍What are the Application Scenarios? Video Generation, Interactive Gaming, Design, XR/AR, and Physical AI. Currently, world models have five main application categories: video generation, interactive gaming, interactive design, XR/AR, and Physical AI. The adoption timeline can be categorized as follows: 1) Commercially Available Products (Initial Stage): Primarily video generation, where users input text or images to receive AI-generated videos. Already commercially applied in scenarios like short videos, advertising, film/TV, short dramas, webtoons, e-commerce, and Physical AI data enhancement. 2) Lab Demo-Level Products: Primarily interactive video generation, where users can perform actions to alter the video state. Expected to find early applications in gaming, design, and XR/AR in the future. 3) Laboratory R&D Directions: World models that can guide Physical AI actions by simulating the consequences of behaviors, paving the way for large-scale Physical AI deployment.
▍Risk Factors: The pace of macroeconomic recovery may fall short of expectations. Relevant industrial policies may not meet expectations. Core technology and product R&D progress at companies may lag. The adoption speed of AI applications may be slower than anticipated. Capital expenditure by cloud service providers may be lower than expected. Government and corporate IT spending may not meet forecasts. AI competition may intensify.
▍Investment Strategy: The accelerated iteration of video generation models and the continuous spillover of technological capabilities are empowering the development of world models, presenting the industry with opportunities for technological upgrades. Coupled with the intensive catalyst of overseas funding, the world model sector is poised to benefit from both model iteration and valuation increases.
Comments