Voice AI Maven Course Notes
Introduction
This is my personal write-up and reflection from taking “Voice AI and Voice Agents” course by Kwindla Kramer on Maven.
I’m sharing my main learnings following this course (part 1), in the spirit of “learn in public”
Also I put together the one thing that I missed while following this course: some map of the voice ai landscape (part 2)
Part 1 - Main learnings
During the course, several concepts initially confused me but became clearer through hands-on experience and research. Here are the main areas where I had breakthrough moments:
1. Transport Layer
When working with voice, lag is worse than degraded audio.
WebSocket is TCP-based. TCP enforces that all packets are delivered in order, triggering re-transmission (and blocking the rest of the sequence) when a packet is missing. For reliable server-to-server connections (like Twilio to your bot server), it’s ok to use WebSocket.
But for client-to-server audio, always use WebRTC. WebRTC is UDP-based, meaning missing packets will be dropped, and won’t generate lag on unreliable connections.
Telephony: if you want to connect to a phone number, you need to connect to that number though PSTN. For more sophisticated scenarios (call hand-offs, multiple callers,…) you need to use SIP.
2. Achieving conversational latency
True conversational latency requires sub-500ms end-to-end processing:
Every component adds latency - transport (network routing), STT processing, LLM inference (especially time-to-first-token), TTS generation, and the return trip.
Geographic proximity matters: having local edge connections can save tens of milliseconds compared to long-haul internet routing.
3. Speech-to-Speech is not production-ready
S2S models can capture information that is lost through text such as intonation, accent,… But they are “not production ready” for most use cases:
Multi-turn conversations and long contexts cause issues with generation reliability and latency.
Additionally, you lose granular control over context management (what part of the conversation or instructions are in the focus at any given time) - the API handles context internally.
S2S models are not as mature as text-2-text models in terms of context management, tool use, instruction following. And how to eval them is an open question. But it’s just getting started.
4. Scaffolding for long and complex conversations
There are two main approaches to architecting the conversation:
The monolithic approach uses one detailed prompt (potentially 5,000+ tokens) to handle the entire conversation flow, but risks task completion failures, tool calling issues, and degraded instruction following as context grows.
The sequential step approach breaks conversations into discrete phases using state machine patterns, allowing for context resets and targeted prompting, but risks losing context between steps and making backtracking difficult.
For simple conversations (1-2 minutes), monolithic works fine. For complex, structured workflows like patient intake, the sequential approach with proper scaffolding often proves more reliable.
Understanding the Voice AI Value Chain
End-to-End Solutions
These platforms handle the entire voice agent pipeline, integrating multiple models, managing the transport layer and providing abstractions for managing the logic and orchestration of the agent.
Companies: Vapi, LiveKit, Layercode, Pipecat Cloud
Inference Providers
These companies allow you to run your models:
Serverless infrastructure providers (Modal, Cerebrium, Baseten): they can run any kind of computation, that includes running a model, on their GPUs
AI inference providers (Groq, Fireworks, fal.ai): they offer open-source and sometimes closed-source models that they optimizen, through an API.
Google, OpenAI: they provide their own model through an API.
Voice Model Specialists
There are many more options than those mentioned during the course. The following website provides comparisons of models performances on the main metrics of interest: https://artificialanalysis.ai/
Speech-to-text: Deepgram, Gladia
Text-to-speech: Cartesia, PlayAI
Speech-to-speech: OpenAI, Google Gemini