DOBOT has launched its self-developed world action model, the DobotWAM Embodied Large Model.
The model achieved an average success rate of 99.25% across four standard task suites on the embodied intelligence benchmark LIBERO, covering spatial relation understanding, object generalization, goal instruction comprehension, and long-horizon task execution.
This performance leads publicly available models such as π0.5, π0, GR00T-N1.5, and π0+FAST, as well as other industry models with published data.
Specifically, the DobotWAM model achieved a perfect 100/100 success rate on the LIBERO-Object suite and scored 99/100 on the Spatial, Goal, and LIBERO-10 suites.
The real challenge for robots in practical applications is no longer just "recognizing objects" but understanding spatial relationships in dynamic, open environments, decomposing task goals, generating kinematically feasible actions, and maintaining global consistency throughout multi-step executions.
While visual-language-action models have become the mainstream paradigm for action generation in embodied AI, they often struggle with spatial perturbations, object variations, long-horizon tasks, and real contact feedback, leading to action drift, goal loss, or local success but global failure.
The high success rate of the DobotWAM model stems from its systematic design in perception, understanding, control, and data closure.
Building on visual-language-action modeling, it incorporates 3D spatial understanding, robot kinematic constraints, and a real-world data closure mechanism, enabling the robot to not just "imitate actions" but "understand why actions are performed that way."
Its core technical breakthroughs encompass four areas.
The first is 3D-Aware Spatial Representation, which integrates 3D spatial information into the model, allowing it to explicitly perceive object positions and geometric structures beyond 2D image features for stronger generalization.
The second is Joint Dynamic Geometry Loss, which incorporates robot joint dynamics and end-effector geometric constraints into the training loss, helping the model understand real action structures to reduce trajectory drift and improve stability in long-horizon tasks.
The third is Advanced VLM Task Decomposition, which uses a sophisticated visual-language model backbone to semantically understand and decompose complex instructions into clearer sub-goals and executable steps, preventing local success but global failure.
The fourth is a High-Quality Data Flywheel with Real-Robot Recap, a closed-loop system centered on real-robot experiments that continuously collects, trains, evaluates, and incorporates feedback from success, failure, and edge-case scenarios to improve sim-to-real transfer.
These four coupled technologies enable the DobotWAM model to more stably complete multi-object, multi-stage, long-horizon robotic manipulation tasks, providing a reusable systemic framework for the large-scale deployment of embodied intelligence.
Comments