The Year of Multi-Modal

JaminBall
01-24 08:53

I have a growing conviction that 2026 will be the year of multi modal AI. There are a handful of trends all coming together at the same time that are set to converge. Multi-modal models get good enough. Inference is getting cheaper and faster (cost curve is important). And the real world starts showing up as first class input. I really believe AI will stop predominantly living in text boxes and instead in places humans actually are.

For the last few years, AI has been overwhelmingly text first, and for good reason. Text was the fastest path to usefulness. It was easy to collect, easy to tokenize, relatively cheap to serve, and generally didn’t have the same latency requirements. If you were building an AI product in 2023 or 2024, starting with text was the rational choice. But text always seemed like a middle state, not an end state. Humans do not experience the world in text. Work does not happen in text. The physical world certainly does not operate in text. Multi modal AI was always where this was heading. And I think we’re close!

What feels different now is how suddenly the pieces are snapping into place. Just in the last week, we saw a wave of production grade text to speech models that would have felt experimental not long ago. $NVIDIA(NVDA)$ PersonaPlex is a good example of how expressive and controllable synthetic voices have become, especially for characters and agents. Inworld TTS also had a release, and it’s clearly optimized for low latency, interactive dialogue rather than polished narration. Flashlabs Chroma 1.0 shows how quickly open ecosystems are closing the quality gap. And $Alibaba(BABA)$ Qwen3 TTS reinforced that this is global and competitive, not confined to a single lab or market. Voice is just one modality, but it is a useful signal that something broader is happening.

At the same time, inference economics are finally catching up, and cost curves are bending. Multi modal AI was more impractical than impossible. Latency was too high. Costs were too unpredictable. Systems were too brittle to trust in real world workflows. That is changing fast. Inference engines are getting more efficient. GPUs are being utilized more effectively. Batching, speculative decoding, and modality specific optimizations are pulling costs down and smoothing tail latency. Teams are also getting more comfortable deploying smaller, specialized models for vision, audio, or sensor data instead of forcing everything through one massive general purpose model. The result is that multi modal inference is no longer something you budget for cautiously but have to confine to smaller audiences or test cases. It’s going mainstream!

The third piece is that the world itself is becoming legible to machines. Cameras, microphones, wearables, industrial sensors, cars, robots, and medical devices are producing continuous streams of data that finally have models capable of understanding them in real time. This unlocks entire categories that text only AI could never reach. Physical environments. Always on monitoring. Workflows that unfold continuously rather than one prompt at a time. Once AI can see, hear, and react, it really can take the next leap in functioanlity.

This is why 2026 matters specifically. All of these trends are converging together. By 2026, model quality is no longer the gating factor for most multi modal use cases. Inference cost and latency are low enough that always on perception is viable. And distribution increasingly shows up through agents, devices, vehicles, and embedded systems rather than chat interfaces. At that point, multi modal can step into the limelight and become a first class citizen. Text only AI will start to feel oddly constrained, the same way desktop only software felt once mobile became ubiquitous.

The mistake is to think of this shift as simply text plus voice, or LLMs plus vision. The deeper change is that AI systems are beginning to experience the world the way humans do. Through multiple senses, continuously, and in context. Text was the on ramp, and 2026 is when AI finally leaves the keyboard! I’m excited for that future

For SG users only, Welcome to open a CBA today and enjoy access to a trading limit of up to SGD 20,000 with unlimited trading on SG, HK, and US stocks, as well as ETFs.

🎉Cash Boost Account Now Supports 35,000+ Stocks & ETFs – Greater Flexibility Now

Find out more here.

Complete your first Cash Boost Account trade with a trade amount of ≥ SGD1000* to get SGD 688 stock vouchers*! The trade can be executed using any payment type available under the Cash Boost Account: Cash, CPF, SRS, or CDP.

Click to access the activity

Other helpful links:

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment