Since I’ve been exploring the extended universe of voice AI and voice agents, I’ve met a lot of people working with them in the real world and it’s not just the average customer support bot anymore. I’ve met teams building debt collection agents, emergency services routing, and even language-specific services for regions outside the US.
In the spirit of Learning in Public, I wanted to write down some of what I know in case someone might learn from it.
I mostly cover conversational voice agents using the full pipeline in this post (STT -> LLM -> TTS) but they are far from the only use case of voice AI. Voice interfaces like Wispr Flow or Willow use STT and LLM without TTS for AI-powered voice dictation in any app. I also maintain an open source voice interface called Tambourine. Bulk transcription and summarization tools like Granola have less real-time requirements like latency and can optimize for intelligence. You could even go the other direction and use only TTS for generating media content. The voice stack is flexible so you can pick the pieces that you need.
I focus on conversational voice agents just because they touch the entire stack, making them a good lens for understanding the technologies involved.
So, if you’re wondering if it’s possible to run voice agents in production in 2026, for me the answer is yes. Even though it might not be as easy as building web apps just yet, many players in the industry are vying to make it simpler for anyone to build in voice. If you’re interested in building a voice agent today, we can start off with choosing what layer of abstraction to build upon.
This post names a few models and providers as examples. There are many more out there so do your own research.
Layers of abstraction
- High-Level: Services like Vapi & Retell offer orchestration-as-a-service. They handle the handoffs between STT, LLM, and TTS, telephony bridging, and complex state management of a live conversation. This path is often chosen for faster speed to market.
- Low-Level: LiveKit & Pipecat are open source frameworks that give you full control over the agent pipeline. Instead of calling a black-box API, you’re building the “brain” of the agent yourself. This path is typically used if you require granular control or need to optimize costs.
Using higher level orchestration services is more straightforward so let’s focus on the low level path to learn more about the underlying technology. When you own the pipeline you need to start thinking about the two fundamental technical challenges of voice AI: The Network and The Model Pipeline.
For a deeper technical dive into these components, the Voice AI and Voice Agents Primer is an excellent resource.
The Network
The average developer rarely thinks about networking problems because most web apps are built around request-response patterns. That works fine because we’ve spent years optimizing the internet for that. When it comes to building voice AI on top of streaming protocols, optimizing the network becomes crucial for performance and quality. You can have the fastest and smartest LLM in the world, but if your audio packets are getting lost hopping across the globe, the conversation will feel broken.
When dealing with audio streaming for voice, you will hear a lot of different infra terms that do not come up as often in the classic HTTP and REST API stack: Websockets, WebRTC, SFU, SIP, PSTN, and so on.
As much of this still requires physical infrastructure, you will likely lean on existing global edge networks. Even LiveKit and Daily (the company behind Pipecat) provide cloud solutions to abstract this away. Daily and LiveKit grew significantly during the teleconferencing era and both have since applied their expertise to building for voice agents. Their global networks are hard for new entrants to replicate quickly, which makes them a good foundation for voice AI infrastructure.
That being said, I’ve been surprised by how many teams are starting to roll their own regional network stack. If you’re building for a specific market (say, Vietnam), you don’t necessarily need a global mesh. You can stand up your own WebRTC infra, using LiveKit’s open-source stack for example. For regional users, this approach can be more efficient than relying on generalized cloud solutions.
Telephony is the final piece of this puzzle. You still need CPaaS (Communications Platform As A Service) providers like Twilio or Telnyx.
Twilio is an established market leader in CPaaS with a mature ecosystem that many enterprises have historically relied on (I used it back in 2019) while Telnyx has focused on vertical integration specific to voice AI, owning their own networking and colocating GPUs for fast model inference.
The Model Pipeline
With the underlying network out of the way, you still have to worry about the actual intelligence part of the agent. Today, the industry standard is still the 3-model pipeline: STT (Speech-to-Text), also called ASR (Automatic Speech Recognition) -> LLM (Large Language Model) -> TTS (Text-to-Speech).
You might wonder why we aren’t all using direct Speech-to-Speech (S2S) models like the Gemini Live API. While such models might be more capable of capturing nuanced emotional cues, production agents still face two major hurdles:
- Intelligence vs. Latency: The models currently capable of handling complex tool calling with the highest reliability are still text-based.
- Cost: Running a conversation through S2S is still significantly more expensive than multi-model pipeline.
By owning the pipeline, you can optimize for latency, cost, and advanced features. Developers typically aim to hit an 800ms voice-to-voice round trip.
STT / ASR
Providers like Deepgram offer sub-300ms transcription, while Speechmatics provides features like diarization for multi-user conversations. You can also often self-host STT models as they are smaller and don’t require GPU clusters, like Nemotron ASR which I use in my personal setup or optimized versions of OpenAI’s Whisper. This isn’t an option with a monolithic S2S model.
Pipecat’s recent STT benchmarks are a great reference for anyone building in this space.
LLM
LLM selection for voice agents follows similar principles as other agentic use cases. You need reliable tool calling, instruction following, and reasoning, but with the added focus on latency. Heavier models give better reliability but add round-trip time, so many production agents use lighter models like Gemini Flash. Providers like Cerebras specialize in delivering fast model inference for open weights models like gpt-oss-120b.
TTS
For the final leg, TTS converts the LLM’s text response into speech. Cartesia and ElevenLabs are some common providers you might have heard of that offer realistic, low-latency speech generation across many languages and tones. The TTS space hasn’t quite settled on a dominant approach yet, with different model architectures still competing. For example, Cartesia’s Sonic is built on State Space Models (SSMs) optimized for streaming, while ElevenLabs uses transformer-based architectures. It’s an area worth keeping an eye on as the research matures.
Collapsing STT and LLM
An interesting approach that blurs the line between the 3-model pipeline and full S2S is Ultravox. Instead of transcribing audio to text and then feeding it to an LLM, Ultravox encodes audio directly into the LLM’s token embedding space. This effectively merges the STT and LLM steps into one, reducing latency while still outputting text that can be fed to a standard TTS model. A tradeoff here is inspectability. With the intermediate transcript from STT you can log, debug, and audit, but with this this approach that visibility is lost.
Taking Turns
Even with a perfect network and fast models, a voice agent can still feel robotic if it doesn’t know when to jump into a conversation or stop talking. Turn management is where some of the hardest engineering challenges are being tackeled right now, solving for VAD and Interruptibility for example.
- VAD (Voice Activity Detection): How do you know when someone is finished talking? Naive volume-based VAD fails because natural speech has pauses between clauses, and silence alone doesn’t mean the speaker is done. We’re moving toward semantic VAD, using a tiny model to understand the intent of the pause. By far the most popular solution right now is Silero VAD.
- Interruptibility: If the agent is speaking and the user starts answering a question, how does the agent handle it? It can’t keep talking over the user and has to stop appropriately to take in the new information quickly and modify its response. Some providers like Pipecat provide their own Smart Turn model to try and solve for this.
Observability and Evals
Another piece of the puzzle in the voice space is observability and evals. This is common in other agentic applications too but the voice stack comes with additional challenges.
Voice agents are harder to evaluate than text agents because they have more dimensions of failure. It’s not just “did it give the right answer?” but “did it interrupt the user inappropriately?” or “was the latency spike caused by the model provider or an intermittent network issue?”
A common approach today is to focus on binary ratings for specific behaviors (“Did the agent complete the booking?” or “Did the agent correctly verify the caller’s identity?”) and keep a human in the loop to review traces, rather than looking for a single accuracy score based on arbitrary metrics. You can curate eval sets from real conversations, often called “Golden Datasets”, to test how tweaking the model pipeline would affect past conversations. Startups like Hamming AI or Coval are attempting to automate parts of this.
It’s time to build
Building in voice right now feels like the early days of LLMs. Whether you rely on out-of-the-box solutions or deploy your own infra, there hasn’t been a better time to be building in voice. There is plenty more to say on any of these topics and I’m barely scratching the surface. While I’m still learning and growing in the space myself, I hope you learned something new from this post, or at least have more resources to dive deeper.
Acknowledgements
I would like to thank all the great folks I’ve met in the community that have advised and supported me with my learning. Kwindla from Daily, Edgars and the rest of the Speechmatics team for the support in building Tambourine. Travis for encouragement on writing this. And everyone else I have met building in the voice community!