Lots of impressive announcements at the OpenAI DevDay yesterday, but I'm going to focus on the implications for audio Conversational AI developers and actual changes available now in the public API we consume.
As expected, many of the announcements were about the public API catching up with features recently added to the ChatGPT interface:
- Media input and output
- New Text to Speech API
- Whisper 3
There is also a new GPT4-turbo model which is released in preview. Although the focus for now is optimising to reduce token costs (2-3x reduction over GPT-4) rather than execution speed, the letter is coming. As well as the cost reduced GPT-4 Turbo, costs are also reduced on the older GPT-3.5 Turbo model by a factor of 2-3.
GPT-4 Turbo Model
This has both an increased context window 128k tokens, and a reduced cost of $0.01 per thousand tokens (input) and $0.03 (output).
This new model also brings some improvements to function calling, allowing more than one call in a completion (parallel calls) and better instruction following for parameters so less chance of spurious call data.
This apparently also carries over to other output format instructions, and a new hard JSON completion mode flag has been added to guarantee JSON output in a similar way to function calls. Useful if you are using model completions in JSON but not functions.
GPT-3.5 Turbo New Model
All of the functional improvements in the new GPT-4 preview model have also been applied to a new `gpt-3.5-turbo-1106` 16k context model. It isn't the current default so will need to be explicitly specified, but will also apply the function calling and output format instruction following.
New TTS endpoint
This is a whole new area of the API and would have been needed to support ChatGPTs audio conversational interface.
https://api.openai.com/v1/audio/speech gets you an mp3, opus, aac or flac output audio file.
There are 6 voices, but no
language parameter so they are all basically
en-US, albeit some of them sound a little bit international. This won't therefore be any use to you if you are building voice-bots outside of the US and certainly English markets.
Very nice, context sensitive, natural inflexion on the voices though. If these do develop for other markets then Google et-al will be in trouble as they are a lot more natural than other vendors current offerings.
Speech to Text
No changes in the API here, but a new Whisper 3 was announced and a new large-v3 model was committed to the public Open Source repo at https://github.com/openai/whisper
Whilst this likely improves accuracy, I haven't done any testing with either API or open source model to see if there have been any changes in runtime efficiency, which is a key issue for latency audio conversational AI interfaces.
It is still a batch processing interface so no streaming yet, although this can obviously be worked around by chunking batches (see my talk at RTC.ON for one approach). This is less optimal compared to APIs like Google that do genuine streaming with provisional and then a final transcription response which are needed to do conversation turn detection in audio conversations.
This is a huge new area of the API which will have been necessary to deploy the new "Custom GPT" ChatGPT functionality. Lots of really exciting features, but currently some massive drawbacks which will make them hard to use initially for audio conversational interfaces.
- Thread context: the assistant keeps a separate thread context for each conversation, removing the need to do external context management
- Run code: can run Python snippets in a sandbox as part of the assistant to reduce call latency over throwing calls back to your server code
- Built in retrieval, allowing the initial huge amounts of data to be incorporated into knowledge base of the agent, and also dynamic retrieval (RAG) to augment the accuracy of responses without coding this into your server.
All the full detail is here: https://platform.openai.com/docs/assistants/how-it-works but there are some major limitation in my opinion, which may lead you to need to keep per thread context in your code and manage this in the "old way":
- Inflexible step execution: once a message has been queued, the documentation states that the thread is locked and can only move forwards to completion or a failure state. There is also a cancel endpoint and state but this isn't guaranteed. I'm going to need to play with this but it is not clear that steps can be rolled back if you have a provisional STT response and then get further user input and a different final transcription and want to re-run a step.
- Polling: once a message is posted, the the client has to poll for state changes to know when the run state changes. Streaming support is on the todo list.
Conclusions and Actionable Insights
Lots of really interesting stuff here, I'm going to be taking a quick look at the new models and function calling changes on a branch of the open source Aplisay llm-agent framework now to see if we can work these in.
The cost reductions are significant and welcome, more important they point to a direction of travel which is a good precedent. A more important functional optimisation goal for natural conversations is the latency. This is apparently a lower priority. GPT-3.5 turbo is more useful in many applications compared to GPT-4 because of it's much shorter execution times. It is a shame we haven't seen this kinds of optimisation in GPT-4 Turbo. Only future benchmarking will tell on this.
The new TTS endpoint looks very interesting, but the apparent US English only focus means I don't see the point in moving over to it yet. The improvements in voice quality are irrelevant unless you only serve the US market.
Assistants are definitely going to be a development area because they incorporate so many different useful features: extended thread context management, RAG, server side code execution in one place. The main issue over building this functionality out yourself is that going with this API is a one way ticket to irreversible dependence on OpenAI offerings. Its current lockstep run model also means that recovering gone wrong turn synchronisation is going to be much harder than doing thread contexts in your own code.