Meta & OpenAI: Revolutionizing Voice/Video AI Apps

JaminBall
07-24

What we're seeing out of the foundation model companies is really getting me excited about the future of Voice / Video AI apps. Won't be long before we have truly human like (real time) experiences with voice / video AI apps. Let me explain - below is what $Meta Platforms, Inc.(META)$ and OpenAI said about their latest models (3.1 and 4o)

Llama 3.1:

"We integrated image, video, and speech capabilities into Llama 3 using a compositional approach, enabling models to recognize images and videos and support interaction via speech. They are under development and not yet ready for release."

GPT-4o

"Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations."

These are really significant developments. Models being able to take direct audio / video files as input not only drastically reduces latency, but also allows for MUCH richer experiences. As the OpenAI quote above lays out, "compressing" an audio or video file to simple text means giving up a lot of important context to the data around just the text. We no longer have to deal with this "compression"!

It's such an exciting time to be building a voice / video AI app. And I'm heavily biased, but @livekit is THE tool to use when building here. Network infrastructure is really complex. Scaling an app using web sockets / vanilla WebRTC is really hard. Let LiveKit handle that complexity for you, like they do for OpenAI, Character, $Spotify Technology S.A.(SPOT)$ and many others!

Wow! Llama 3.1 license allows users to use outputs from Llama 3.1 to train their own customized models. This is different from Llama2 and 3. Labeled training data is a huge bottleneck for newer models. You can now use Q&A pairs from Llama3.1 (and other outputs) to train a new model. Barriers to entry for model development continue to drop

From the FAQs:

"Can I use the output of the models to improve the Llama family of models, even though I cannot use them for other LLMs?

For Llama 2 and Llama 3, it's correct that the license restricts using any part of the Llama models, including the response outputs to train another AI model (LLM or otherwise). For Llama 3.1 however, this is allowed provided you as the developer provide the correct attribution. See the license for more information."

Will probably see a whole host of other smaller models converge on the Llama 3.1 quality

https://x.com/jaminball/status/1815870278943269180

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment