OpenAI launching their speech-to-speech real-time voice API is a big deal. It's now possible to build experiences like Advanced Voice Mode. Go build with OpenAI and livekit!
Here's why the speech-to-speech API matters so much. Historically models could only accept a text input, and only give a text output. In order to build a voice experience you needed to string together multiple tools. You'd use something like LiveKit to handle all of the network infrastructure and optimization. Then you'd use a speech to text provider to convert the audio file to text to feed into the model. The model runs inference and then spits out text, that is feed into a text-to-speech provider. That audio file is then sent over the network (LiveKit) and played out the speaker in the application.
This creates two "issues"
1) All of that conversion takes time. Stringing together multiple processes just increases the lag / wait time from one speaker to the other
2) Converting an audio file to text is a form of compression. And when you compress a file you loose data. There's important context in an audio file that goes away in text. The tone of voice, which words are emphasized, etc These two "issues" make it harder to create human like experiences. Either they aren't real time (because so much extra work has to happen in the background), or they lack a true emotional connection when it's all text based OpenAI launching the speech-to-speech API is a huge first step in creating truly human-like experiences with AI
And when it comes to the network infrastructure - many builders will first try web sockets or vanilla WebRTC. It's a good way to start, but as you scale, performance issues will be felt. Instead, start from a solid foundation with LiveKit
OpenAI audio real time works out to $10-$15 / hour ($120 / 1M tokens at 80/20 blended). This will vary depending on how much end user is talking vs model is "talking," but should be directionally right. Many states have minimum wage above this.
While the initial pricing may seem high, it’s important to consider the historical pricing trends for AI models. For example, the GPT-4 family of models has experienced a price reduction of over 90% in the past 18 months. This trend suggests that as technology advances and the models become more efficient, costs tend to decrease significantly over time.
With everything inference related, the price you see today will not be tomorrow's price :) Business models that don't make sense on today's pricing will make sense on tomorrow's pricing! Have to build with this in mind
https://twitter.com/jaminball/status/1841213689741132125
Comments