On January 22, according to an announcement on the Qwen official WeChat account, the Qwen3-TTS "full suite" was open-sourced and launched. Qwen3-TTS is a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice creation, ultra-high-quality anthropomorphic speech generation, and speech control based on natural language descriptions, providing developers and users with the most extensive speech generation capabilities. Leveraging the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, Qwen3-TTS achieves efficient compression and strong representational capabilities for speech signals, not only fully preserving paralinguistic information and acoustic environment characteristics but also enabling high-speed, high-fidelity speech reconstruction through a lightweight non-DiT architecture. Qwen3-TTS utilizes Dual-Track modeling, achieving exceptional bi-directional streaming generation speed where the first audio packet requires waiting for just a single character. The entire multi-codebook series of Qwen3-TTS models has been open-sourced, including two sizes: 1.7B and 0.6B. The 1.7B model delivers peak performance with powerful control capabilities, while the 0.6B model balances performance and efficiency. The models cover 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian) and various dialectal voice tones, meeting global application needs. Simultaneously, the models possess robust contextual understanding capabilities, allowing them to adaptively adjust tone, rhythm, and emotional expression based on instructions and text semantics, with a significant improvement in robustness against input text noise. The models are now available as open-source on GitHub and can also be experienced via the Qwen API.
Comments