Embodied AI Models Serve as the Intelligent Core for Humanoid Robots, with Data Flywheel Driving Iterative Advancement

Stock News04-16

Industrial Securities Co.,Ltd. has released a research report stating that embodied AI models act as the "brain" for humanoid robots, empowering them across the "perception-cognition-control" spectrum. These models emphasize interaction with the physical world and require capabilities such as multimodal perception, autonomous decision-making, real-time interactive execution, generalization, and versatility. Currently, leading manufacturers employ diverse data collection and training methodologies. Robots gather data on their external environment and internal status via sensors, providing crucial data support for the decision-making processes of embodied AI models. The report suggests focusing on companies involved in humanoid robot sensors and those mastering motion capture solutions. Key viewpoints from Industrial Securities are outlined below.

Embodied AI models function as the "brain" of humanoid robots, governing the interactive loop of "perception-cognition-control." Traditional large models focus on handling tasks within single or limited modalities and lack the ability for direct interaction with the physical world. In contrast, embodied AI models empower robots by emphasizing physical world interaction, necessitating the aforementioned advanced capabilities. The primary barrier to large-scale adoption of humanoid robots currently may not be hardware limitations but rather bottlenecks in AI model development. From an industry progression standpoint, robotic limb technology is relatively mature, while the advancement of large models lags significantly behind hardware development. Current embodied models already possess cognitive, reasoning, and planning abilities. Their main shortcomings lie in difficulty reliably handling the uncertainties of complex physical environments and exhibiting weak generalization capabilities.

Mainstream frameworks for embodied AI models are divided into hierarchical and end-to-end approaches, with no consensus yet on a definitive path. Traditional decision-making uses a layered architecture involving perception and interaction, high-level planning, low-level execution, and feedback/reinforcement. This "cerebrum-cerebellum" layering can facilitate the practical deployment of humanoid robots. However, the hierarchical paradigm suffers from error accumulation and performs poorly in generalizing across diverse tasks. End-to-end frameworks directly output specific robot execution commands based on perceived environment and robot state, integrating perception, language understanding, planning, action execution, and feedback optimization into a unified structure. This offers high integration and stronger generalization, with Vision-Language-Action (VLA) models being central to end-to-end decision-making.

Overseas embodied AI models include typical pure end-to-end architectures like Google DeepMind's RT-2 and Tesla's FSD. RT-2 aims to map visual and language information directly to robot actions via an end-to-end neural network. Tesla's Optimus can leverage the technological stack from its automotive FSD system to achieve multimodal input and real-time action output. Representative hierarchical embodied models include Figure AI's Helix, NVIDIA's GR00T, and Physical Intelligence's π1. Helix employs a dual-system architecture of "System 1 (fast thinking) + System 2 (slow thinking)." GR00T also uses a dual-system approach, utilizing flow matching techniques to generate actions. π1 employs a VLA model based on a "pre-trained VLM + action expert module."

Domestic embodied AI model architectures continue to innovate, with capabilities benchmarked against international counterparts. Notable models include Agile Robots' G0-1, Star Dynamics ERA-42, Galaxy General's GraspVLA, Lingchu AI's Psi R1, and ByteDance's Seed GR-3. G0-1 pioneered the ViLLA architecture, using a "VLM + Mixture of Experts (MoE)" approach. The ERA-42 model is China's first genuine end-to-end native robot large model. The GraspVLA model integrates VLM with action experts, representing the world's first foundational grasping model driven by synthetic large-scale data. The Psi R1 model adopts a fast-slow brain architecture. GR-3 utilizes a hybrid transformer architecture with 4 billion parameters, demonstrating generalization in grasp-and-place tasks that surpasses π1.

Data is the key driver for the iterative upgrade of embodied AI models. The mainstream training approach currently combines data from physical robots, simulations, and videos. As embodied intelligence shifts towards end-to-end models, data requirements are evolving from small-volume, single-modal data to massive, multimodal, high-precision, long-horizon, cross-task data. Physical robot data holds the highest value, is the most challenging to acquire, and is a reliable data source for deploying embodied intelligence. Current methods for real data collection primarily include VR teleoperation, master-slave control of robotic arms, and data glove teleoperation. Leading manufacturers employ diverse data collection and training strategies; Tesla's data collection approach may shift towards video learning, while Galaxy General primarily uses physical simulation data supplemented by real data.

Investment recommendations are as follows. Robots rely on sensors to acquire external and internal state data, which supports decision-making for embodied models. It is advised to monitor companies related to humanoid robot sensors. Motion capture solutions are a key source of high-quality movement data. It is suggested to focus on companies that possess expertise in motion capture solutions.

Risk factors include humanoid robot mass production progress falling short of expectations, slower-than-anticipated advancements in large model technology, and training data scale and quality not meeting expectations.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment