Why good audio conversational AI isn't shipping at scale yet
OpenAI showed a fantastic demo of real, natural sounding, latency free conversational AI in an update nearly two months ago, but it isn't shipping yet. It probably won't be for a while. Here is why...
Talking to an AI isn't exactly new. From Alexa, to telephone bots, we have been doing it for ages with varying levels of success.
Directly tokenising speech into a large language model input is however new and very interesting. Up to now, we have "talked" to AI (including large language models) using a clunky multi-step process to convert our speech into something closer to a text chat:
- Convert a segment of speech to text (STT), guessing at the correct length to match a conversation turn
- Feed it into a conversational large language model (latest LLM approach) or natural language understanding (the old scripted NLU approach), and get some textual results back
- Convert textual results into speech (TTS)
All of this tends to make the conversations somewhat less than natural. It introduces latency, some of which we can remove through techniques like streaming, but much of which we can't due to the fundamental multi-step process.
We lose all of the nuance in speech when we convert it to text. Emphasis which bestows different meaning to the same sequence of words that would be instantly recognisable to another human, never make it to the language model because they aren't expressed in the bare transcript we feed it.
Finally we have issues with things like pauses and interruptions which are intrinsic parts of speech that humans just deal with. Programmatic multi-step pipelines really struggle with these because of the need to predict conversation turns, and rewind interruptions. The end result is that no matter how smart developers of the pipeline are, talking to an AI is like using a walky-talky over a lossy long distance link. It still works, especially if we create enough value to make it worth the humans training themselves to adapt to the half duplex mode to get the best results. It doesn't however feel terribly natural when compared to a real human conversation.
Making it more natural
The ultimate fix for all of the above issues is to coalesce speech, sentiment, timing and meaning through real time full duplex audio streaming straight into the tokenisation pre-processor of the model. Combine this with model output tokens being instantly converted into an audio stream and we have a large language model that actually understands vocal language and can respond in the same way a human would.
This is called speech-to-speech. It sidesteps most of the problems with multi stage text based pipelines and has the potential to make a language model apparently more quick-witted and perceptive. So much so that I'm ready to predict that we will have to start slowing LLMs conversational skills down to stop them being intimidating.
OpenAI have already developed speech-to-speech (at least in demo form), and there are other strong competitors working in this area and achieving really interesting results. Fixie.ai deserves special mention because it's Ultravox multi-modal processor is based on Open Source Llama Models and is available now.
The current focus on this as the forefront of LLM research means that most of the other LLM vendors will likely be actively developing a speech-to-speech capability.
It is a reasonable hypothesis that the era of speech-to-text-to-model-to-text-to-speech pipelines for AI interaction will shortly come to an end for most applications. There will be hold outs. In particular the front end of the pipeline the speech to model part is the one where most of the gains come from. There will still be good reasons to do a hybrid approach of speech-to-model-to-text-to-speech where you have a specialist TTS requirement, for example to use a custom trained voice (e.g. Elevenlabs style voice cloning).
So if everyone is developing speech-to-speech, and it is so much better, why can't I talk to an AI like this in production yet?
Tokenising speech directly and allowing us to talk to AIs seems to be a solved problem, at least in research labs. Efficiently rolling it out at scale is proving to be a bit harder.
Why can't I use it yet?
When we do an existing multi-stage pipeline, middleware software sitting between the user and the LLM provider orchestrates the STT, TTS, keeps track of this state, reacts to the user conversation turns, and manages the LLM. All the LLM provider needs to do is provide a stateless, low bandwidth text API which generates a chat completion each time the middleware calls it. If anything goes wrong this middleware holds enough state to recover the user experience in some meaningful way.
The cloud AI provider does what they do well, manages enough GPUs to respond in a relatively timely but otherwise best efforts fashion to a torrent of stateless text based API calls. They can load balance each request around the world to anywhere they currently have resources and, if something goes wrong, an occasional "too bad, try again" message is fine as the client will re-try.
It doesn't work like this for audio sessions. For a start they aren't individual transactions, but a continuous stream from start to end of whatever constitutes a session. From when the user makes their microphone live to when they are done, the LLM provider has to sink the continuous audio stream somewhere in their network, keep the conversation state themselves, tokenise that stream and feed it into a GPU.
OpenAI have 180 million users for ChatGPT, with 100 million weekly users. For most of the time, even when the users are sitting in a text chat session and communicating with their platform, they aren't actively using resources. The chat client spends most of its time waiting for the user to type something. When they have finished typing and pressed send, the client bundles a request with all of the state, latest bit of typing appended, to an API to be load balanced and put through some hardware somewhere as a discrete transaction. Even if the user is doing audio chat, it doesn't change the nature of the transaction much. The client captures their audio, sends a complete segment as a file to the OpenAI Whisper interface, which turns it into text at much greater than real-time throughput, then it is sent to the same text based interface.
Lets say those 100 million weekly users on average are in an active chat for 1 hour a week on a new speech-to-speech model. They need to build an infrastructure which can sink 400 million hours of WebRTC streams a month, connect each concurrent conversation to a persistent, stateful tokeniser in their network, feed the tokens into a model on a GPU with very predictable low latency, then convert the output tokens to audio and send it back on the other half of the stream. Not only is this a huge scale of thing to build, it is also entirely orthogonal to the job of providing GPUs running models. It is not something they will necessarily have much expertise at. It is the same kind of problem that conferencing vendors had scaling their infrastructure to keep business running when the Covid pandemic took us all by surprise, but without the head-start of already having working architectures that just needed scaling.
My prediction is that, whilst the base AI technology is probably sound and settled, the job of building an architecture to sink this much audio data in real time and hold the associated state at the scale of their entire user base is going to keep OpenAI busy for many months yet. When it does arrive, it will be rolled out very cautiously and incrementally.
What about other Vendors?
One to watch here is Fixie.ai. They may not have the raw AI momentum of OpenAI, although they do have an impressive, working, open source model already. What they also have is a team lead by the people that defined WebRTC in the first place and built Google's conferencing architectures.
OpenAI and Fixie.ai almost certainly aren't the only people developing in this space. Ability to roll out WebRTC media processing at scale may just turn out to be more important and useful for acquiring momentum in this new speech-to-speech space than the AI itself!
Other implications
Right now, providing the middleware to convert telephone audio conversation into text, keep transaction state, call an LLM for text completions, and convert back to audio is a valuable emerging service that a few established and new entrant "smart" telephony companies are providing. If audio interfaces through WebRTC become just one of the things that model providers intrinsically offer anyway, primarily for web and app consumers then they eliminate 90% of the complexity (and hence value proposition) of this service. LLMs on the telephone just becomes gatewaying between SIP telephony media streams and WebRTC provided by the LLM API. Telephony providers could instead counter by developing their own vertical services based on open source models leading to a race to see who can own the customers faster.
What do you think? feel free engage on Twitter, LinkedIn or contact me privately.