Alibaba's Tongyi Laboratory has released and open-sourced Fun-CineForge, the first multimodal large model capable of supporting film and television-grade dubbing across multiple scenarios. The release is accompanied by open-source methodologies for constructing high-quality datasets. Through an integrated "data + model" design, Fun-CineForge aims to address key long-standing challenges in AI-powered cinematic dubbing.
In real-world film and television production, high-quality dubbing must simultaneously pass four rigorous tests: lip synchronization, requiring synthesized speech to closely match the lip movements of characters on screen; emotional expression, enabling human-like portrayal and flexible control of emotions and tone based on character appearance and descriptive instructions; voice consistency, maintaining the similarity and uniformity of each character's voice in complex multi-role dubbing scenarios; and temporal alignment, ensuring speech is synthesized within the correct time intervals even when the speaker is obscured or not visible.
Current AI dubbing methods face two primary bottlenecks. First, there is a scarcity of high-quality multimodal datasets. Existing dubbing datasets are often too small, have limited annotation types, and are costly to produce manually at scale. They also lack long-form video data featuring dialogues and multi-speaker scenarios, making it difficult for large models to handle complex dubbing situations. Second, model capabilities are insufficient. Traditional dubbing models primarily rely on clearly visible lip regions in video frames for audio-visual synchronization. However, real-world dubbing involves numerous complex scenes—such as multi-person dialogues, frequent camera cuts, facial obstructions, and blurry faces—where current technology struggles to achieve synchronization without a clear view of the speaker's face.
To tackle these issues, Tongyi Laboratory introduced Fun-CineForge. The open-source release consists of two core components designed to create a closed-loop "data-model" pipeline for cinematic dubbing: a multimodal dubbing large model tailored for complex film scenarios, and a construction pipeline for large-scale multimodal dubbing datasets (CineDub).
Leveraging the robust underlying speech synthesis capabilities of CosyVoice3, Fun-CineForge builds a dubbing large model that processes video and text inputs to generate speech. The inputs include a silent video clip, dubbing text, character attributes, emotional cues, timing information, and a reference voice. The model can then synthesize speech that aligns closely with the timing and video information, matching the timbre of the reference voice.
Fun-CineForge first establishes an automated dataset production pipeline that converts raw film and television footage into structured multimodal data. This pipeline involves voice separation, text transcription, long-video segmentation, and audio-visual joint speaker diarization. A bidirectional correction mechanism based on large language model chain-of-thought reasoning significantly reduces error rates in transcribed text and speaker separation results. The Chinese character error rate dropped from 4.53% to 0.94%, the English word error rate decreased from 9.35% to 2.12%, and the speaker diarization error rate fell from 8.38% to 1.20%. The resulting data covers various typical scenarios, including monologues, narrations, dialogues, and multi-speaker interactions. Each data entry contains transcribed lines, frame-level facial and lip data, character attribute and emotional cues, millisecond-level timestamps, and clean voice tracks. These complementary multimodal elements provide a solid foundation for training the model's professional dubbing capabilities.
A key innovation of Fun-CineForge is the introduction of a "temporal modality" into the dubbing model. Unlike traditional TTS models that focus mainly on text, audio features, or visual information, cinematic dubbing requires an additional critical dimension: time. This includes knowing when a character starts and stops speaking, and which character is speaking during a specific time interval. This temporal information directly helps the model understand "who is speaking what and when." When the visual modality cannot detect a speaker, the temporal modality acts as a strong supervisory signal, ensuring speech is generated in the correct time segment. This capability enables the model to handle dubbing in complex scenes.
To achieve this, the Fun-CineForge model utilizes four types of complementary information. The visual modality learns lip movements and facial expressions; the text modality provides dialogue content and describes character attributes and emotional tone; the audio modality serves as the model's prediction target; and the temporal modality controls the timing of speech and indicates speaker identity in dialogue scenes.
Experimental results show that Fun-CineForge outperforms existing open-source dubbing models across several key metrics, including speech naturalness, word error rate, emotional expressiveness, voice similarity, lip synchronization, temporal alignment accuracy, and instruction-following capability. The model performs best in single-speaker scenarios like monologues and narrations, and is the first to support both two-person and multi-person dialogues while achieving precise temporal alignment, audio-visual sync, and voice consistency.
Tongyi Laboratory conducted a comprehensive evaluation of Fun-CineForge on its self-built CineDub dataset, covering various typical cinematic dubbing scenarios such as monologues, narrations, dialogues, and multi-speaker scenes. Results indicate optimal performance in single-speaker settings, with Chinese character error rates of just 1.49% for monologues and 1.90% for narrations, alongside accurate audio-visual synchronization. In monologue scenarios, a comparison with DeepDubber-V1 and InstructDubber showed that Fun-CineForge significantly outperformed the baseline models in word error rate, lip synchronization, temporal alignment, and voice similarity.
Fun-CineForge is now open-source, allowing developers to immediately experience its capabilities for Chinese and English cinematic dubbing in various complex scenarios, including emotional expression, camera cuts, and facial obstructions. The project website provides rich examples covering monologues, narrations, dialogues, multi-speaker interactions, voice cloning, and instruction-based control. These samples demonstrate performance under challenging real-world conditions such as frequent shot changes, speaker switches, obscured faces, dark lighting, and scenes with multiple characters.
The technical paper, "Fun-CineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes," is available. Sample data from the CineDub dataset, with original video content removed, has been open-sourced on the website for reference, including bilingual Chinese (CineDub-CN) and English (CineDub-EN) versions.
While AI voice technology is already widely used in applications like customer service and virtual assistants, professional animation and film production demand higher standards. For longer videos, providing more timestamp intervals and reference character audio can lead to declines in audio-visual synchronization performance, voice cloning accuracy, and robustness in multi-speaker dialogue scenes. Fun-CineForge offers a new technical solution for applying large audio models in professional dubbing production; it currently supports inference on video segments up to 30 seconds long. As multimodal large model capabilities continue to advance, AI is expected to play an increasingly significant role in content production for film, animation, gaming, and related fields.
Comments