Testing Voice AI Agents: Vapi, Retell, Bland, and AI Phone Calls
Voice AI agents handle phone calls, customer support, and outbound outreach using LLMs. Testing them is fundamentally different from testing chat applications: you're testing speech recognition accuracy, conversation flow correctness, tool call execution from voice context, and latency — all in a medium where failures are immediately obvious to real humans on the other end of a call.
Key Takeaways
Voice AI has three layers to test: telephony, LLM reasoning, and tool execution. The telephony layer (audio quality, latency, ASR accuracy) is separate from the LLM reasoning layer (correct responses, conversation flow) and tool execution layer (booking systems, CRMs). Each fails differently.
Latency is a user experience metric, not just a performance metric. A 3-second response feels broken in a phone call. Test time-to-first-audio, not just total response time.
Test with realistic conversation simulations, not just unit API calls. A voice agent that handles individual questions correctly can still fail when they're combined in a real conversation flow.
Webhook testing is critical. Vapi/Retell/Bland notify your server about call events (started, ended, tool calls). Your webhook handler needs dedicated tests — it's the integration point where most bugs live.
Voice AI Agent Architecture
Before testing, understand what you're actually testing:
User (phone) → Telephony (Twilio/carrier) → ASR (speech-to-text)
→ LLM (Vapi/Retell/Bland) → TTS (text-to-speech) → User (phone)
↓
Tool calls (webhooks to your server)
↓
Your backend (CRM, booking, database)Each arrow is a failure point. Your testing strategy should cover:
- Configuration tests — agent prompt, voice, tools are set up correctly
- Conversation flow tests — agent handles expected scenarios
- Tool call tests — webhook calls are correct, your handler works
- Latency tests — response times meet voice UX requirements
- Edge case tests — interruptions, silence, unclear speech
Testing with Vapi
Configuration Validation
import requests
import pytest
VAPI_API_KEY = "your-vapi-api-key"
VAPI_BASE_URL = "https://api.vapi.ai"
AGENT_ID = "your-agent-id"
headers = {
"Authorization": f"Bearer {VAPI_API_KEY}",
"Content-Type": "application/json"
}
def get_assistant(assistant_id: str) -> dict:
response = requests.get(
f"{VAPI_BASE_URL}/assistant/{assistant_id}",
headers=headers
)
response.raise_for_status()
return response.json()
def test_assistant_config_is_correct():
"""Verify assistant configuration before running any calls"""
assistant = get_assistant(AGENT_ID)
# Voice configuration
assert assistant.get("voice") is not None, "Voice must be configured"
assert assistant["voice"].get("voiceId") is not None
# Model configuration
model = assistant.get("model", {})
assert model.get("provider") in ["openai", "anthropic", "groq"]
assert model.get("model") is not None
# System prompt
assert model.get("messages") or model.get("systemPrompt"), (
"Assistant must have a system prompt"
)
def test_assistant_has_required_tools():
"""Verify all expected tools are configured"""
assistant = get_assistant(AGENT_ID)
model = assistant.get("model", {})
tool_names = [t.get("function", {}).get("name") for t in model.get("tools", [])]
required_tools = ["book_appointment", "check_availability", "get_customer_info"]
for tool in required_tools:
assert tool in tool_names, f"Required tool '{tool}' not found in: {tool_names}"
def test_first_message_is_set():
"""First message should greet the caller, not be empty"""
assistant = get_assistant(AGENT_ID)
first_message = assistant.get("firstMessage", "")
assert len(first_message) > 10, "First message should be a real greeting"
assert first_message[0].isupper(), "First message should start with capital letter"Programmatic Call Testing via Vapi API
import time
def start_test_call(phone_number: str, assistant_id: str) -> dict:
"""Initiate a test call through Vapi"""
payload = {
"phoneNumberId": "your-phone-number-id",
"assistantId": assistant_id,
"customer": {
"number": phone_number,
"name": "Test User"
}
}
response = requests.post(
f"{VAPI_BASE_URL}/call/phone",
headers=headers,
json=payload
)
response.raise_for_status()
return response.json()
def get_call_details(call_id: str) -> dict:
"""Get call details after it completes"""
response = requests.get(
f"{VAPI_BASE_URL}/call/{call_id}",
headers=headers
)
response.raise_for_status()
return response.json()
def wait_for_call_completion(call_id: str, timeout_seconds: int = 120) -> dict:
"""Poll until call completes"""
start_time = time.time()
while time.time() - start_time < timeout_seconds:
call = get_call_details(call_id)
if call.get("status") in ["ended", "failed"]:
return call
time.sleep(3)
raise TimeoutError(f"Call {call_id} did not complete within {timeout_seconds}s")
# Note: These tests require real telephony and incur costs
# Run with: pytest -m "voice_e2e" only in dedicated test environments
@pytest.mark.voice_e2e
def test_basic_call_completes():
"""Test that a call can be initiated and completes"""
call = start_test_call("+15551234567", AGENT_ID)
call_id = call["id"]
completed_call = wait_for_call_completion(call_id)
assert completed_call["status"] == "ended"
assert completed_call.get("endedReason") != "error"Testing with Vapi's Chat API (No Phone Required)
Vapi provides a chat interface to test conversation logic without making real calls:
def chat_with_assistant(
assistant_id: str,
messages: list[dict],
session_id: str = None
) -> dict:
"""Test conversation logic via Vapi chat API"""
payload = {
"assistantId": assistant_id,
"input": messages[-1]["content"],
}
if session_id:
payload["sessionId"] = session_id
response = requests.post(
f"{VAPI_BASE_URL}/chat",
headers=headers,
json=payload
)
response.raise_for_status()
return response.json()
def test_greeting_response():
"""Agent should greet caller appropriately"""
response = chat_with_assistant(AGENT_ID, [
{"role": "user", "content": "Hello"}
])
output = response.get("output", "").lower()
assert any(word in output for word in ["hello", "hi", "welcome", "good morning"])
assert len(output) > 20, "Greeting should be substantive"
def test_appointment_booking_intent():
"""Agent should route to booking tool when asked to book"""
response = chat_with_assistant(AGENT_ID, [
{"role": "user", "content": "I'd like to book an appointment for next Tuesday"}
])
# Either the agent responds with available slots or calls the booking tool
output = response.get("output", "").lower()
tool_calls = response.get("toolCalls", [])
booking_handled = (
"available" in output or
"appointment" in output or
any(tc.get("name") == "check_availability" for tc in tool_calls)
)
assert booking_handled, (
f"Agent should handle booking intent. Output: {output[:200]}, Tools: {tool_calls}"
)
def test_out_of_scope_redirect():
"""Agent should redirect off-topic questions"""
response = chat_with_assistant(AGENT_ID, [
{"role": "user", "content": "Can you write me a poem about cats?"}
])
output = response.get("output", "").lower()
# Should politely redirect, not write a poem
assert "poem" not in output or "sorry" in output or "help you with" in output
# Should mention what it CAN help with
assert any(word in output for word in ["appointment", "schedule", "book", "help you"])Testing Retell AI
Retell provides a similar programmatic testing API:
RETELL_API_KEY = "key_xxxxxxxx"
RETELL_BASE_URL = "https://api.retellai.com"
RETELL_AGENT_ID = "your-retell-agent-id"
retell_headers = {
"Authorization": f"Bearer {RETELL_API_KEY}",
"Content-Type": "application/json"
}
def test_retell_agent_exists_and_configured():
"""Verify Retell agent is properly configured"""
response = requests.get(
f"{RETELL_BASE_URL}/get-agent/{RETELL_AGENT_ID}",
headers=retell_headers
)
assert response.status_code == 200
agent = response.json()
assert agent.get("llm_websocket_url") or agent.get("response_engine"), (
"Agent must have an LLM endpoint configured"
)
assert agent.get("voice_id"), "Agent must have a voice configured"
def create_retell_web_call(agent_id: str) -> dict:
"""Create a web call for testing (no phone needed)"""
response = requests.post(
f"{RETELL_BASE_URL}/create-web-call",
headers=retell_headers,
json={"agent_id": agent_id}
)
response.raise_for_status()
return response.json()
def test_retell_web_call_creation():
"""Web call should be created with access token"""
call_data = create_retell_web_call(RETELL_AGENT_ID)
assert "call_id" in call_data
assert "access_token" in call_data
assert call_data["call_status"] in ["created", "registered"]Testing Webhooks (Critical Path)
Your webhook handler receives all call events. This is where most production bugs live:
from fastapi import FastAPI, Request
from fastapi.testclient import TestClient
import json
app = FastAPI()
# Store received events for testing
received_events = []
@app.post("/webhook/vapi")
async def vapi_webhook(request: Request):
payload = await request.json()
received_events.append(payload)
message_type = payload.get("message", {}).get("type")
if message_type == "tool-calls":
# Handle tool calls
tool_calls = payload["message"]["toolCallList"]
results = []
for tc in tool_calls:
result = await handle_tool_call(tc["function"]["name"], tc["function"]["arguments"])
results.append({
"toolCallId": tc["id"],
"result": result
})
return {"results": results}
return {"status": "ok"}
async def handle_tool_call(name: str, arguments: dict) -> str:
if name == "check_availability":
return json.dumps({"available_slots": ["Tuesday 2pm", "Wednesday 10am"]})
elif name == "book_appointment":
return json.dumps({"confirmed": True, "booking_id": "BK-12345"})
return json.dumps({"error": "Unknown tool"})
# Tests for webhook handler
webhook_client = TestClient(app)
def test_webhook_handles_tool_call():
"""Webhook should execute tool calls and return results"""
tool_call_event = {
"message": {
"type": "tool-calls",
"toolCallList": [
{
"id": "call_abc123",
"function": {
"name": "check_availability",
"arguments": {"date": "next Tuesday"}
}
}
]
}
}
response = webhook_client.post("/webhook/vapi", json=tool_call_event)
assert response.status_code == 200
data = response.json()
assert "results" in data
assert len(data["results"]) == 1
assert data["results"][0]["toolCallId"] == "call_abc123"
# Verify the result is valid JSON with available slots
result = json.loads(data["results"][0]["result"])
assert "available_slots" in result
def test_webhook_handles_call_ended_event():
"""Webhook should process call end without errors"""
call_ended_event = {
"message": {
"type": "end-of-call-report",
"call": {
"id": "call-test-001",
"status": "ended",
"endedReason": "customer-ended-call",
"durationSeconds": 120
},
"summary": "Customer booked appointment for Tuesday 2pm",
"transcript": "User: Hello\nAssistant: Hi! How can I help?"
}
}
response = webhook_client.post("/webhook/vapi", json=call_ended_event)
assert response.status_code == 200
def test_webhook_rejects_invalid_payload():
"""Malformed webhook payloads should return 400"""
response = webhook_client.post("/webhook/vapi", data="not json",
headers={"content-type": "application/json"})
assert response.status_code in [400, 422]
def test_webhook_handles_unknown_tool_gracefully():
"""Unknown tool calls should return error result, not 500"""
event = {
"message": {
"type": "tool-calls",
"toolCallList": [
{
"id": "call_xyz",
"function": {
"name": "nonexistent_tool",
"arguments": {}
}
}
]
}
}
response = webhook_client.post("/webhook/vapi", json=event)
assert response.status_code == 200 # Webhook should not crash
data = response.json()
result = json.loads(data["results"][0]["result"])
assert "error" in resultTesting Conversation Flows End-to-End
Test complete conversation scenarios using Vapi's chat API:
def run_conversation_flow(turns: list[str], agent_id: str) -> list[dict]:
"""Run a multi-turn conversation and return all responses"""
responses = []
session_id = f"test-session-{int(time.time())}"
for user_message in turns:
response = chat_with_assistant(agent_id, [
{"role": "user", "content": user_message}
], session_id=session_id)
responses.append(response)
time.sleep(0.5) # Small delay between turns
return responses
def test_complete_booking_flow():
"""Test the full appointment booking conversation"""
turns = [
"Hi, I need to book an appointment",
"I'm available next Tuesday afternoon",
"2pm works great for me",
"My name is John Smith, phone 555-0100"
]
responses = run_conversation_flow(turns, AGENT_ID)
assert len(responses) == 4
# Turn 1: Should ask what kind of appointment
turn1_output = responses[0].get("output", "").lower()
assert any(word in turn1_output for word in ["what", "type", "appointment", "help"])
# Turn 4 (confirmation): Should confirm booking
turn4_output = responses[3].get("output", "").lower()
assert any(word in turn4_output for word in ["confirmed", "booked", "scheduled", "appointment"])
def test_cancellation_flow():
"""Test that the agent can handle a cancellation request"""
turns = [
"I need to cancel my appointment",
"My booking ID is BK-99123",
"Yes, please cancel it"
]
responses = run_conversation_flow(turns, AGENT_ID)
# Final response should confirm cancellation
final_output = responses[-1].get("output", "").lower()
assert any(word in final_output for word in ["cancelled", "canceled", "confirmed", "done"])Latency Testing
Voice AI has strict latency requirements. Response delays above 1.5-2 seconds feel broken:
import statistics
def measure_chat_latency(message: str, agent_id: str) -> float:
"""Measure response time for a single chat message in ms"""
start = time.perf_counter()
chat_with_assistant(agent_id, [{"role": "user", "content": message}])
return (time.perf_counter() - start) * 1000
@pytest.mark.performance
def test_chat_response_latency():
"""Response latency should meet voice UX requirements"""
LATENCY_SLO_MS = 2000 # 2 seconds max for voice
test_messages = [
"Hello",
"I need to book an appointment",
"What times are available tomorrow?",
"Can I reschedule?",
"Thank you, goodbye"
]
latencies = []
for msg in test_messages:
latency = measure_chat_latency(msg, AGENT_ID)
latencies.append(latency)
print(f"'{msg[:30]}' → {latency:.0f}ms")
avg = statistics.mean(latencies)
p95 = sorted(latencies)[int(len(latencies) * 0.95)]
print(f"\nAverage: {avg:.0f}ms, P95: {p95:.0f}ms")
assert avg < LATENCY_SLO_MS, f"Average latency {avg:.0f}ms exceeds {LATENCY_SLO_MS}ms SLO"
assert p95 < LATENCY_SLO_MS * 1.5, f"P95 latency {p95:.0f}ms too high"CI Pipeline for Voice Agent Tests
# .github/workflows/voice-agent-tests.yml
name: Voice Agent Tests
on:
push:
branches: [main]
paths:
- 'voice-agent/**'
- 'webhooks/**'
jobs:
webhook-tests:
# Fast, no external calls, run on every PR
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install pytest fastapi httpx
- run: pytest tests/webhooks/ -v
agent-config-tests:
# Validate agent config, read-only API calls
runs-on: ubuntu-latest
env:
VAPI_API_KEY: ${{ secrets.VAPI_API_KEY }}
steps:
- run: pytest tests/agent_config/ -v
conversation-flow-tests:
# Full conversation tests, run on main only
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
env:
VAPI_API_KEY: ${{ secrets.VAPI_API_KEY }}
steps:
- run: pytest tests/conversation_flows/ -v -m "not voice_e2e"End-to-End Monitoring
After deployment, monitor your voice agent continuously. Key metrics:
- Call completion rate: % of calls that end cleanly vs. dropped
- Tool call success rate: % of tool calls that return valid results
- First response latency: Time to first audio byte
- Conversation completion rate: % of calls that achieve the goal
For automated end-to-end voice agent monitoring — testing that your full call flow still works after code changes — HelpMeTest can run scheduled health checks that simulate the webhook interactions your voice agent depends on, alerting you when tool call handlers break before users call in.
Key Takeaways
Voice AI testing has three distinct layers: configuration validation (is the agent set up correctly?), webhook testing (does your server handle call events?), and conversation flow testing (does the agent handle real scenarios?). Always test webhooks with unit tests — they're the highest ROI testing investment for voice agents. Use the chat API for conversation flow testing without incurring real call costs. Add latency tests since 2+ second delays are immediately noticeable in phone calls. Run webhook tests on every PR, conversation flow tests on main.