On April 3, Alibaba Cloud's Tongyi Lab announced the official launch of its video generation model, Wan2.7-Video. The model supports multimodal input including text, images, video, and audio, focusing on the entire "creation" chain. It covers processes such as generation, editing, replication, reshaping, driving, continuation, and referencing, and is described as more controllable, versatile, and capable of "directing and performing." Wan2.7 supports full-modal input and allows users to control aspects like scene composition, plot direction, local details, and temporal changes, making video as editable as a document. Users can make local adjustments to video frames via instructions, with edited areas blending naturally with the original video in terms of lighting and texture. The model also supports adding or removing elements on command, replacing objects, and modifying object attributes. It enables precise additions based on reference images. Additionally, Wan2.7 can transform environments and styles while keeping character actions unchanged; for example, shifting a background season from summer to late autumn or instantly converting to a wool-felt style, creating a parallel universe effect. The model also enhances video quality, performs visual understanding tasks, and adjusts shooting methods to meet diverse editing needs. For pre-shot or generated video content, it allows modifications to plot and filming techniques through descriptive commands. Wan2.7 enables disruptive changes to character behavior, dialogue, and camera angles without altering the original identity and setting, facilitating flexible secondary creation. It can modify spoken dialogue while ensuring emotions, lip movements, and voice tone remain consistent with the new lines. Actions can also be altered, such as changing a seated character to standing while playing a game, with only the action logic modified. The model supports radical reinterpretations of characters within the same scene, like replacing a player with a medieval knight while maintaining the original posture. Camera settings, including position, angle, shot type, lens type, and focus, can be adjusted. For instance, changing the camera to rise gradually from the ground offers a completely different viewing experience from the same footage. Wan2.7 provides precise control over plot direction, visual composition, and lighting through methods like using start/end frames, video continuation, or continuation with an end frame, balancing dynamic continuity and structural control. It also supports multimodal references from images, video, and audio to lock in appearance and voice characteristics. With support for up to five video subject references, each character can have a unique voice, and features remain more consistent across multiple shots.
Comments