Insights from Force Lingji's Tang Wenbin: The Standalone "World Model" Approach Is Not Viable

Deep News11:30

A covert battle over data in the field of embodied intelligence is quietly unfolding. In January of this year, the Hubei Humanoid Robot Innovation Center delivered thousands of hours of training data to Zhiyuan Robotics, marking China's first customized humanoid robot data transaction. Among industry giants, JD.com recently announced its ambition to build the world's largest and most comprehensive embodied intelligence data collection center, planning to mobilize over 100,000 internal employees and up to 500,000 external personnel in an unprecedented "mass manpower" strategy. Overseas, South Korean robotics company Robotis established a subsidiary in Uzbekistan in January, with plans to construct a massive "data factory" on a 110,000-square-meter plot to gather robot behavior data. These initiatives—customized transactions billed by the hour, mobilizations of hundreds of thousands of people, and establishing factories in Central Asia—reflect the profound "data anxiety" gripping the entire embodied intelligence industry.

Unlike large language models trained on internet corpora, embodied intelligence requires understanding and interacting with the real world, placing higher demands on data authenticity and modality. This is one of the key challenges currently being tackled by Tang Wenbin, founder and CEO of Force Lingji. Tang is better known as the co-founder and CTO of Megvii, a star unicorn from the previous wave of AI. Within just one year of its establishment, Force Lingji has quietly raised over 1 billion yuan, securing investments from top-tier institutions like Alibaba, NIO, Legend Capital, and Qiming Venture Partners. The company has already launched its first embodied-native large model, DM0, and has formed a strategic partnership with Huaqin Technology to achieve mass production and delivery of the data collection robot DOS-W1.

Having experienced the challenges of the previous AI implementation wave, Tang Wenbin approaches the industry with increased respect. In a recent dialogue, he shared Force Lingji's data collection strategy: rather than relying on a single source, they employ a distributed collection approach based on a combination of "Quality × Quantity × Diversity" to comprehensively fill the robot's capability space. Regarding the approach of using world models to generate data for robot imitation learning, Tang believes this path is impractical. He suggests a more feasible paradigm is to unify world models with VLA (Vision-Language-Action) models, enabling not only prediction of future states but also inference of the precise actions required.

As industry players frantically "hoard" data resources in their own ways, the market is watching to see which approach will ultimately prevail.

**Detailed Explanation of Data Collection**

When asked about their data collection methodology, Tang Wenbin explained they currently utilize an imitation learning plus reinforcement learning approach. Imitation involves simulating data distributions. The goal is for data to sufficiently fill the robot's capability space, allowing it to "see" enough variety. The core lies in the ability to handle unseen scenarios; the value of data is precisely this. Therefore, their data collection focuses on open environments and real-world scenarios. They aim for data to be high-quality while comprehensively covering this space, viewing data as a combinatorial problem of "Quality × Quantity × Diversity."

Regarding specific collection methods, Tang stated they do not depend on any single data source, deeming it unnecessary, and instead use a combined model. For real robot data, collection is primarily done through various calibrated sensors, including devices like exoskeletons, though he acknowledged the cost is indeed high. Simultaneously, they collect data through embodiment-free methods and first-person perspectives to form larger datasets, which represents a middle ground between real robot and synthetic data. Additionally, they utilize lower-cost internet data.

Explaining "embodiment-free" collection, Tang said it involves using devices like a glove or handheld gripper without the robot's arm or body—essentially just an end-effector. The approximate position and state of this end-effector are recorded, a method currently known as UMI. First-person perspective data, such as recording operations through glasses, is also a form of embodiment-free collection.

Addressing privacy concerns with AI glasses data, Tang acknowledged that as a user, he wouldn't want to share his data either. For training, they can hire third-party data collectors to wear glasses and record work processes, with the data being logged accordingly. They also hope glasses themselves will become more powerful, featuring stereo vision and multi-objective capabilities. Future plans may include adding wristbands or gloves for data collection. Overall, their collection targets are diverse: Category 1 is the robot itself, operable via remote control; Category 2 is embodiment-free devices like grippers, combining "human body + robot end-effector"; Category 3 targets the human body specifically; Category 4 involves descriptions of the physical world.

Regarding sensors on end-effectors, Tang clarified they seek multi-modal data, not just force, including additional perspectives. Practically, since arms might block some data, they could mount a camera near the eyes and potentially two cameras on each wrist to capture multi-view data.

On the cost of such collection, Tang described it as a complex trade-off between quality, quantity, and diversity. Collecting data for all modules would be prohibitively expensive. Therefore, they adopt a distributed collection strategy, ensuring completeness for some data types while sacrificing some completeness for others to reduce cost, increase volume, and improve speed. They have their own collection tools and collaborate extensively across industries.

Discussing the data collection robot launched with Huaqin Technology in February, Tang said it's primarily for research scenarios, similar in form to ALOHA robots, which peers are also developing. He identified two major pain points in current market offerings: reliability, where frequent failures negatively impact research efficiency, and high cost. Their improvements include simplifying repairs with a modular, detachable design for quick part replacement—e.g., using knobs instead of screws for 30-second fixes—and reducing cost through the partnership with Huaqin, designing an ALOHA-like product supporting master-slave, drag-and-teach operation. The core advantages are fast repairability and low cost.

When asked if peers purchase these robots for data collection, Tang confirmed, noting industry pain points are consistent, leading companies to buy each other's products for combined use.

**The World Model Path Is Impractical**

On the topic of world models versus VLA, Tang Wenbin distinguished between understanding the world and generating it. Current large models are often noted for their world understanding capability. World models attempt to predict the future, like the next frame, while VLA's essence is interacting with the world. These models share commonalities but solve problems from different angles. He believes the best strategy is integration, enabling true understanding, content generation, and world interaction. Theoretically, if one can predict the future world, one can infer the necessary actions. Conversely, knowing how to act implies an ability to predict outcomes. Thus, their technical framework unifies world models and VLA, aiming for a single model that both understands the world and predicts subsequent states, allowing it to execute actions and predict the resulting world changes.

Regarding differing industry frameworks, Tang acknowledged some companies advocate using only world models, proposing that generating data via world models for robot imitation learning creates an infinite data source. However, he personally finds this path unviable because if the world model were fully realized, the generation problem would be solved, eliminating the need to train robots with generated data. The alternative path, which he and many peers follow, involves predicting the future world state and then deducing the required actions from that model. This paradigm, calculating action sequences based on predicted future scenes, aligns with the integrated, unified model framework he described.

**Application Scenarios and Challenges**

Addressing whether robots have a place in highly automated factories, Tang agreed that current automation solutions are mature. However, their goal is to address previously unsolvable problems or those with prohibitively high solution costs. Many existing automated production lines don't require high generalization across objects, environments, and tasks—e.g., with few SKUs and controlled external conditions like lighting. The unsolved challenges involve object diversity, dynamically changing environments, and varied tasks. Using logistics as an example, he noted current robots primarily handle transportation but struggle with manual operations requiring high generalization, like packaging a diverse range of items such as a soda bottle and a chip bag under varying conditions. Another example is packaging bottled shower gel, where workers manually wrap the cap with plastic film to prevent leakage—a task difficult to automate. Force Lingji is currently exploring applications in logistics and industry.

When asked if they focus on specific scenarios or multiple areas simultaneously, Tang explained a dual perspective is necessary. Observing large model development, especially recent progress, reveals a common trend: models confined to a single vertical domain cannot achieve true generalization capability. Therefore, from a model perspective, they must steadfastly pursue generalization and more universal technical capabilities. However, from an application landing perspective, they must tackle scenarios one by one. Internally, they emphasize two core points for product implementation: first, the solution must form a closed loop, addressing all customer business problems, exceptions, and process needs; second, costs must be controllable, making collaboration worthwhile for the client. Only when these conditions are met can clients consider scaling. Each scenario implementation requires clear understanding of customer value and ensuring these two points are achieved—a process of gradual, annual adoption. They describe the relationship between model development and application landing as having a 45-degree angle—related but not perfectly correlated. Ultimately, their model needs to progress towards generality.

**The Path to Generalization and Hardware Constraints**

On whether they advocate for a general-purpose robot route, Tang expressed his view that while models can be general, hardware is difficult to make universal. He highlighted the versatility of human hands, capable of both precise manipulation and lifting heavy weights (20kg or even 50kg). However, constrained by physics and materials science, a robotic arm lifting 2kg is fundamentally different from one lifting 20kg due to differences in power density. Therefore, applying a universal design to a specific scenario often results in under-engineering or over-engineering. Under-engineering might mean failing weight limits or having insufficient sensor space; over-engineering, while possibly functional, leads to prohibitively high costs. Using a wheeled dual-arm robot as an example, he noted that with a high center of gravity, it can move fast but struggles to stop without falling. In some scenarios, a stationary solution with items delivered by mobile vehicles might be better, indicating potential over-design. Their internal logic is to make the model general and adaptable to different hardware platforms.

**Investor Focus and Team Strengths**

When asked if investors primarily value their model capabilities, Tang agreed, highlighting the team's uniqueness: deep involvement in robot scenario development coupled with strong model expertise. Leveraging experience from Megvii's logistics sector, which operated at scale, they possess substantial product understanding and a team focused on model optimization.

Addressing potential weaknesses in understanding scene requirements due to their model-centric origins, Tang pointed to their extensive scenario work at Megvii, describing the team as "educated" in this regard. He sees it as a mindset issue: the robotics industry needs both technically-focused and scenario-savvy people; they position themselves in the middle. Purely technical teams often make assumptions about scenes, underestimating complexities, while real-world "devils are in the details"—like the need for robust exception handling when production cannot halt. Thus, technical personnel must respect the intricacies of scenarios. Conversely, industry personnel can oscillate between viewing AI as omnipotent and, upon encountering limitations, reverting to traditional rule-based methods out of disappointment. Current model development is in a middle, rapidly advancing stage—neither all-powerful nor useless. They crucially need individuals who can judge scenarios, understand algorithms and their pace of development, and design actionable project starting points. All their work essentially fulfills demands, acknowledging their own perspective limitations. Tang advocates broad learning and multi-angle observation but stresses the need for judgment standards to select sustainably viable scenarios.

**Target Customers and Open Source Strategy**

Regarding target customers—robot companies or end-user application scenarios—Tang clarified it's primarily the latter. Frankly, he noted, models used by peers domestically and internationally are not yet mature enough for simple training and deployment on robot company hardware. Before model maturity, vertical integration is necessary for successful application landing. Expecting partners or clients to solve scenarios if they themselves cannot is wishful thinking. Eventually, they might handle some vertical scenarios themselves while enabling more through an open platform, allowing partners to use their hardware, "brain" (model), or both to explore possibilities.

On open-sourcing their model, Tang cited two considerations: hoping more people use their framework and model to jointly explore applications and drive technology adoption, and recognizing that despite high industry enthusiasm, overall model maturity is still early-stage, making mutual exchange and progress crucial.

**Goals and Future Outlook**

Regarding the core goal of deploying 1,000 sustainably operating devices per scenario by 2026, Tang indicated sustained operation might be achievable in the second half of the year, with POC tests currently ongoing. He expressed confidence in the potential for batch deployment in their own scenarios. Achieving sustainable robot operation requires identifying fault-tolerant links; frankly, model-driven methods cannot yet achieve 100% accuracy. There must be answers for task failures—how to take over and recover failed tasks, and assessing the business impact's acceptability. After implementing fallback solutions, the system's ROI must be confirmed.

On ROI, Tang said clients directly ask about the payback period. Projects needing over five years are non-starters; those with a 2-3 year payback are pursued immediately. In the current B2B environment, decisions are rational, calculating efficiency gains—e.g., extending operational hours or utilizing equipment more effectively.

Teasing future model updates, Tang said this year's core focus will be on generalization.

When asked if starting an embodied intelligence model company last year was too late, Tang revealed a long-standing desire to build a general-purpose robot, previously hindered by technical immaturity. Advances in models like DeepSeek have increased his confidence.

Finally, asked for one keyword for the embodied intelligence industry in 2026, Tang offered two: "model capability enhancement" and "sustained operation in scenarios." He believes current models are early-stage but developing rapidly, necessitating efforts to improve algorithmic capabilities in object, environment, and task generalization. For scene applications, mere POCs are just a starting point; the focus must be on sustainable real-world operation, which he feels is timely for this year.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Insights from Force Lingji's Tang Wenbin: The Standalone "World Model" Approach Is Not Viable

Comments