Sun Wukong truly became a headache for the celestial court after obtaining the Golden Cudgel—a随心所欲的本命法宝 that made him unstoppable. On March 17, DingTalk launched an AI platform named "Wukong." It can take over your browser, search for information on your behalf, and operate your computer when you're away—it has hands and feet, capable of execution. Meanwhile, Alibaba's recently released Qwen3.5-Omni, a全模态 model that can process video, audio, and break down audiovisual content into structured data ready for immediate use, closely resembles Sun Wukong's Golden Cudgel. Currently, the monkey and the cudgel are not yet fully integrated. But once combined, this system will be extremely powerful.
**I. What Tasks Can Wukong Perform?** DingTalk's Wukong is a capable yet rule-abiding enterprise-level "lobster."
(1) Cross-Platform Price Comparison with One Command I instructed it to search for "DJI Osmo Pocket 3" on Taobao, JD.com, and Pinduoduo, compare prices and sales volumes, take screenshots, and compile the data into an Excel file. It took over my browser—opening Taobao, entering keywords, scrolling through results, saving screenshots; then moving to JD.com and repeating the process; followed by Pinduoduo. After completing the searches across all three platforms, an Excel file appeared on my desktop: the top five cheapest and highest-selling products were listed by platform, store, price, and link, with the lowest price highlighted in red. It wasn't just "telling" me which was cheaper; it was actively "doing" the comparison, screenshotting, and tabling for me. The entire process required only a single command from me. There were some rough edges—users need to be logged into each platform beforehand, otherwise, captchas can block it.
(2) Content Radar The second practical application doesn't happen in front of a computer. I sent a message to Wukong via the DingTalk mobile app: "Set a daily scheduled task for 9 AM to automatically open my computer browser, search for 'latest AI developments to create an AI-related topic,' extract three summaries with source links, and send them to my phone." Wukong调用ed the relevant Skill, automatically created the task. The next day, just after 9 AM, my phone received a neatly formatted morning briefing with clickable links.
(3) Customer Acquisition and Website Building I also tested Wukong on a website-building task, selecting skills from the official skill market. It generated a functional website with complete source code—the aesthetics need refinement, but its ability to go from zero to one is evident. Marketing departments could use it for scheduled competitor monitoring; an "Animation Master" skill can generate complete data animation videos from a single command.
The launch event featured even more ambitious demonstrations. A car repair shop manager told Wukong, "Help me attract 100 customers." The AI autonomously completed the entire process from competitor analysis, studying popular content, posting on social media, to guiding comments. If these scenarios can run stably in daily operations, it indicates AI is evolving from "executing commands" to "completing the job for you."
Beyond the highlights, there are inevitable early-stage instabilities. Official data cited one case where a user reported consuming approximately 270 million tokens to create a single PowerPoint presentation. As AI moves from dialogue to execution—operating files,反复修改,跨系统调用—token consumption increases by orders of magnitude. DingTalk claims Wukong's RealDoc file system improves token efficiency by five times, which is the right direction. However, for cost-conscious SMEs, more stable systems and higher-quality skills are likely needed to make the ROI calculation clear and viable.
**II. What Does the Golden Cudgel Look Like?** Wukong has hands and feet but currently lacks eyes and ears. It can operate browsers, read documents, and execute跨端 tasks, but it cannot yet understand what happens in a video or discern who is speaking and their tone in an audio recording. Many have experienced this: a two-hour meeting recording sits unused in cloud storage because reviewing it is nearly as time-consuming as holding the meeting again. A popular带货 video is seen, its conversion logic seemingly worth studying, but there's no time for frame-by-frame analysis. English podcasts,方言客服录音—listened to once and then forgotten. Vast amounts of valuable audiovisual content are consumed but never utilized.
Alibaba's newly released Qwen3.5-Omni aims to transform this "seen and forgotten" content into "disassembled and usable" data. Our tests involved拆解ing a popular TikTok带货 video. Inputting a Yiwu recruitment带货 video, the model provided a structured breakdown across seven dimensions: Hook,卖点排序, visual proof points, subtitle strategy, emotional rhythm, CTA timing, and target audience. A key insight was notable—"This video isn't selling a product, but certainty": a three-tier physical evidence chain builds trust, "20,000 SKUs + $0.20 average price" creates a numerical anchor, and保姆式承诺 achieve risk reversal. More importantly, it demonstrated迁移能力: when asked to write a script for a "custom T-shirt factory" using the same logic, it successfully output an executable 5-step template, changing the Hook to "stretching a T-shirt to show elasticity," replacing proof with "close-up of inkjet printing + colorfastness after rubbing," and even drafting comment section engagement guides.
Another test involved "dictating code." After hand-drawing a rough app wireframe and dictating requirements via camera, the model generated runnable React code. Subsequent verbal modifications—sidebar, rounded corners, dark theme, press animations—were handled across multiple iterations without losing context. This watch-speak-modify interaction is humanity's most natural mode, and the model handled it effectively.
Underpinning these capabilities are a混合注意力 MoE architecture, native multimodal pre-training on over 100 million hours of audio data, achieving SOTA on 215 third-party benchmarks, with several metrics surpassing Gemini-3.1 Pro. It features a 256K context window, supporting over 10 hours of audio. Capabilities include speech recognition for 113 languages and dialects, and TTS synthesis for 36. Pricing is under ¥0.8 per million tokens for input—less than one-tenth the cost of Gemini-3.1 Pro. In summary, Qwen3.5-Omni makes audiovisual content "disassemblable"—not just "understood," but broken down into searchable, reusable data assets ready for immediate application.
**III. When Wukong Wields the Golden Cudgel** Wukong can operate browsers, read/write files, execute跨端 tasks, and调用 thousands of DingTalk capabilities, but its inability to process audiovisual content limits its utility in natural business scenarios. Qwen3.5-Omni, which can拆解 videos into timestamped structured data, understand multilingual audio, and interpret mixed visual and auditory input, perfectly fills this gap.
If successfully integrated: You provide a two-hour meeting recording. It doesn't just generate minutes—it identifies who said what and when, detects语气 like certainty or hesitation, flags action items, and then directly creates tasks in DingTalk, assigns them to relevant personnel, and sets deadlines. This transitions from "understanding the meeting" to "executing the meeting's conclusions" without manual intervention.
Marketing teams would no longer need to manually monitor competitors' short video accounts. The AI could autonomously watch competitor videos,拆解 their conversion logic—similar to the Qwen3.5-Omni TikTok analysis—output transferable script templates, and then use Wukong to automatically publish adapted content on social media, even progressing to customer acquisition. This streamlines the process from "competitor analysis" to "content production" to "customer conversion."
A more routine application:客服录音质检. Previously, this required manual listening, note-taking, and scoring, limiting daily throughput. With full multimodal capabilities, the AI could listen to all recordings, output emotional trajectories and technique scores per call, flag problematic interactions, generate improvement suggestions, and log results into DingTalk's management system.
The underlying logic across these scenarios is consistent: perception → understanding → execution, forming a complete loop. Wukong addresses execution; Qwen3.5-Omni addresses perception. Furthermore, Qwen3.5-Omni's sub-¥0.8/million token pricing makes the entire flywheel economically feasible. The puzzle pieces are nearly assembled.
**Conclusion** In Journey to the West, Sun Wukong was formidable from the moment he emerged from the stone. But he grew significantly stronger after obtaining the Golden Cudgel, finding a master, and embarking on his journey. DingTalk's Wukong has already emerged. The Golden Cudgel has just been forged but not yet handed over. The journey is long—token costs need reduction, the product requires refinement, and awareness must be built among 27 million enterprises, one by one. But the monkey, the cudgel, and the path are all present.
Comments