Skip to content
← Back to Blog
March 15, 2026 · 6 min read · Elias Teubner
#ai #python #voice-ai #case-study

How I Built an AI Phone Agent That Actually Works

What if your business never missed a customer call? Not a chatbot. Not "press 1 for sales." An actual AI agent that picks up the phone, understands the caller, verifies their identity, takes their order, and routes it — all in natural German conversation. That's what I built. Here's how.

The Architecture

The system is a pipeline of specialized components, each doing one thing well:

  1. 3CX PBX handles the phone line — receives the call, manages the SIP connection
  2. LiveKit manages the real-time audio stream — WebRTC for ultra-low latency
  3. Python agent processes the speech, runs the LLM, generates responses
  4. Azure VM (Linux) hosts the whole thing

The caller dials a normal phone number. 3CX picks up and routes the audio to LiveKit. The Python agent listens, thinks, and responds — all in real-time. The caller doesn't know they're talking to an AI (at least not immediately).

The Smart Parts: Fuzzy Customer Verification

When someone calls, they often say something like "Hi, this is Palfinger from Salzburg." The system needs to match that against a customer database. People don't say their exact company name, though — they abbreviate, use nicknames, mumble.

That's why I built a confidence-based fuzzy matching system with three thresholds:

  • 86%+ match — Auto-verified. The system is confident enough to proceed.
  • 72-85% match — Ask for clarification. "Did you mean Palfinger GmbH in Salzburg?"
  • Below 50% — Can't verify. Route to a human.

These thresholds are configurable per deployment — some clients want stricter matching, some are fine with looser bounds.

The Dashboard

You can't deploy a phone agent and hope for the best. You need visibility. That's why I built a full web dashboard with:

  • Real-time KPIs — calls today, verification rate, average call duration, fuzzy decisions
  • Complete call log — every call with full transcript, timestamps, and verification results
  • Configuration panel — edit the system prompt, welcome message, and fuzzy thresholds without touching code

Azure Blob Storage holds everything. Before every call, the agent pulls the latest config — changes in the dashboard take effect immediately, no redeploy needed.

The Hard Part: Latency

Voice AI lives or dies on latency. The difference between "natural conversation" and "awkward AI pause" is literally milliseconds:

  1. Speech-to-text needs to be fast and accurate
  2. LLM inference takes time, especially for complex responses
  3. Text-to-speech needs to sound natural, not robotic
  4. Every step has to land in under 500ms or the conversation feels broken

LiveKit's WebRTC pipeline is the hero here. It keeps the audio stream real-time while the agent processes in parallel. I also stream responses — the agent starts speaking before the full answer exists, the way a human starts a sentence before knowing exactly how it ends.

The German Language Challenge

Most voice AI demos are in English. German is harder:

  • Compound words — "Arbeitszeiterfassungsgesetz" is one word. Good luck with speech-to-text.
  • Formal vs. informal — the agent needs to use "Sie" (formal) in business contexts
  • Dialects — Austrian German is different from standard German. "Servus" vs. "Hallo."

Getting the LLM to respond in natural, formal German without sounding robotic took a lot of prompt iteration. I use Claude for coding and the core conversation logic, Gemini for research and product testing, and Copilot for anything Microsoft/Azure-related — debugging Azure deployments, configuring services, understanding platform-specific behavior. Each tool has its lane.

Azure Infrastructure

The deployment uses several Azure services, each chosen for a reason:

  • Key Vault — secrets management (API keys, database credentials)
  • Blob Storage — config files, call transcripts, audio recordings
  • Web Apps — the dashboard (always-on, fast, scales automatically)
  • Linux VM — the agent runtime (needs more control, GPU-adjacent compute)

Where It Stands

The agent handles real calls in production. The known edge cases — heavy dialect, fast speakers — are tracked, measured, and fed back into prompt iteration and matching refinement. That feedback loop is intentional. Every call makes the system sharper.

Our clients don't miss calls anymore. The product does what it's supposed to do, it runs in production, and it's getting better with every interaction. That's not luck — that's the system working as designed.

— Elias

Share this post