On November 27, Giant Network Group Co.,Ltd.'s AI Lab, in collaboration with Tsinghua University's SATLab and Northwestern Polytechnical University, announced three new multimodal generation technologies in the audio-visual domain. The research outcomes will be progressively open-sourced on platforms like GitHub and HuggingFace.
The three innovations released include: 1. **YingVideo-MV**: A music-driven video generation model capable of producing synchronized music video clips using just "one music track plus one character image." The model performs multimodal analysis of rhythm, emotion, and structural content in music, enabling seamless alignment of camera movements (e.g., pans, zooms, tilts) with audio. Its long-sequence consistency mechanism mitigates common issues like character distortion and frame jumps in lengthy videos.
2. **YingMusic-SVC**: A zero-shot singing voice conversion model optimized for real-world music scenarios. It minimizes interference from accompaniments, harmonies, and reverb, reducing vocal breakage and pitch distortion risks while providing stable support for high-quality music reproduction.
3. **YingMusic-Singer**: A singing synthesis model that generates natural vocals with clear pronunciation and stable melody from arbitrary lyrics input. Its flexibility in adapting to varying lyric lengths and zero-shot timbre cloning enhances AI-assisted music creation, lowering the barrier for artistic production.
These advancements underscore the team's progress in multimodal audio-visual generation technologies.
Comments