JD.com (ASX: JD) has announced the open-source release of its JoyAI-Echo framework for long-form audio-video generation. This framework tackles three major industry challenges: maintaining consistent character appearance, controlling voice stability, and accelerating generation speed, enabling the creation of long videos that are both fast and high-quality. Furthermore, JoyAI-Echo's "edit-while-chatting" mode transforms video creation from a static generation process into a dynamic, collaborative workflow. With significant application potential across video production, digital human streaming, brand marketing, education, and gaming content creation, the launch of JoyAI-Echo marks a major breakthrough for JD.com in the long-form video generation arena, positioning it among the global leaders.
Key Innovations Addressing Core Challenges
While AI-generated short videos of a few seconds are becoming increasingly sophisticated, the industry has struggled to scale to minute-long content. At this longer duration, persistent issues emerge: a character's appearance changes between shots, a speaker's voice fluctuates or alters unexpectedly, and generation times become impractically slow, taking minutes or even half an hour. These problems have confined AI long-form video to a "toy" stage, preventing its serious use in production and value creation. JoyAI-Echo breaks this deadlock with four key technological innovations.
The first is a cross-modal audio-visual memory bank, which prevents characters from "changing faces." This is JoyAI-Echo's most critical breakthrough. The framework incorporates a dedicated memory bank that continuously stores and retrieves a character's visual features and speaker voice characteristics throughout multi-shot generation. This ensures high consistency in character identity, visual appearance, and voice timbre across videos as long as five minutes, eliminating the awkward scenario of a character morphing into someone else mid-scene.
The second innovation is memory-driven post-training, which boosts speed by 7.5 times. The development team introduced a novel memory-driven post-training pipeline, combining SFT, cross-modal RLHF, and Distribution Matching Distillation (DMD) techniques. This not only significantly improves generation quality but also achieves remarkable inference acceleration. The DMD technology alone delivers an approximately 7.5x speed increase, turning long video generation from a lengthy wait into a near-instantaneous process.
Third, JoyAI-Echo incorporates an intelligent "director's assistant" – the Director Agent – enabling conversational editing for long-form video for the first time. Moving beyond the traditional model of inputting a prompt for a one-time output, JoyAI-Echo allows users to state requirements in natural language. It automatically parses these into scripts, characters, scenes, and shots. Users can then request modifications through conversation, and the framework regenerates only the problematic local shots instead of the entire video, shifting long-form creation from static generation to dynamic collaboration.
Fourth, a lightweight real-time super-resolution module ensures smooth high-definition output. To meet professional content production demands, JoyAI-Echo includes a dedicated real-time super-resolution module supporting two resolution upgrades (736×1280 → 1152×1920, 736×1280 → 1472×2560). This module generates high-resolution video and refined audio in a single super-resolution step, maintaining stable HD performance even under streaming latency constraints.
Leading Performance Metrics Herald a New Era
To objectively evaluate JoyAI-Echo's performance, the development team constructed a long audio-video generation evaluation set based on 100 stories and 3,000 shots, conducting comprehensive multi-dimensional testing. The results show that JoyAI-Echo leads in all core metrics, including cross-shot consistency, video quality, text consistency, and speech content accuracy. Its speech content accuracy score reached 0.8646, significantly outperforming other industry models.
In comparative user evaluations against similar industry models, JoyAI-Echo was preferred 81.7% of the time for audio quality, 80.6% for prompt adherence, 63.6% for visual aesthetics, and 59.4% for IP consistency.
The introduction of JoyAI-Echo signifies the arrival of the "long-form AI video era." It opens new possibilities for virtual story and animation production, digital human content creation and live streaming, rapid iteration of brand marketing videos, and interactive educational content generation, promising to significantly optimize industry cost-efficiency. JoyAI-Echo also foreshadows a future where humans can continuously create, modify, and refine long video content as easily as having a conversation, integrating high-consistency, high-quality, and interactive video generation into the workflow of every content creator.
Comments