Testing Voice AI Agents: Vapi, Retell, Bland, and AI Phone Calls

Testing Voice AI Agents: Vapi, Retell, Bland, and AI Phone Calls

Voice AI agents handle phone calls, customer support, and outbound outreach using LLMs. Testing them is fundamentally different from testing chat applications: you're testing speech recognition accuracy, conversation flow correctness, tool call execution from voice context, and latency — all in a medium where failures are immediately obvious to real humans on the other end of a call.

Key Takeaways

Voice AI has three layers to test: telephony, LLM reasoning, and tool execution. The telephony layer (audio quality, latency, ASR accuracy) is separate from the LLM reasoning layer (correct responses, conversation flow) and tool execution layer (booking systems, CRMs). Each fails differently.

Latency is a user experience metric, not just a performance metric. A 3-second response feels broken in a phone call. Test time-to-first-audio, not just total response time.

Test with realistic conversation simulations, not just unit API calls. A voice agent that handles individual questions correctly can still fail when they're combined in a real conversation flow.

Webhook testing is critical. Vapi/Retell/Bland notify your server about call events (started, ended, tool calls). Your webhook handler needs dedicated tests — it's the integration point where most bugs live.

Voice AI Agent Architecture

Before testing, understand what you're actually testing:

User (phone) → Telephony (Twilio/carrier) → ASR (speech-to-text) 
→ LLM (Vapi/Retell/Bland) → TTS (text-to-speech) → User (phone)
                                ↓
                    Tool calls (webhooks to your server)
                                ↓
                    Your backend (CRM, booking, database)

Each arrow is a failure point. Your testing strategy should cover:

  1. Configuration tests — agent prompt, voice, tools are set up correctly
  2. Conversation flow tests — agent handles expected scenarios
  3. Tool call tests — webhook calls are correct, your handler works
  4. Latency tests — response times meet voice UX requirements
  5. Edge case tests — interruptions, silence, unclear speech

Testing with Vapi

Configuration Validation

import requests
import pytest

VAPI_API_KEY = "your-vapi-api-key"
VAPI_BASE_URL = "https://api.vapi.ai"
AGENT_ID = "your-agent-id"

headers = {
    "Authorization": f"Bearer {VAPI_API_KEY}",
    "Content-Type": "application/json"
}

def get_assistant(assistant_id: str) -> dict:
    response = requests.get(
        f"{VAPI_BASE_URL}/assistant/{assistant_id}",
        headers=headers
    )
    response.raise_for_status()
    return response.json()

def test_assistant_config_is_correct():
    """Verify assistant configuration before running any calls"""
    assistant = get_assistant(AGENT_ID)
    
    # Voice configuration
    assert assistant.get("voice") is not None, "Voice must be configured"
    assert assistant["voice"].get("voiceId") is not None
    
    # Model configuration
    model = assistant.get("model", {})
    assert model.get("provider") in ["openai", "anthropic", "groq"]
    assert model.get("model") is not None
    
    # System prompt
    assert model.get("messages") or model.get("systemPrompt"), (
        "Assistant must have a system prompt"
    )

def test_assistant_has_required_tools():
    """Verify all expected tools are configured"""
    assistant = get_assistant(AGENT_ID)
    model = assistant.get("model", {})
    
    tool_names = [t.get("function", {}).get("name") for t in model.get("tools", [])]
    
    required_tools = ["book_appointment", "check_availability", "get_customer_info"]
    for tool in required_tools:
        assert tool in tool_names, f"Required tool '{tool}' not found in: {tool_names}"

def test_first_message_is_set():
    """First message should greet the caller, not be empty"""
    assistant = get_assistant(AGENT_ID)
    
    first_message = assistant.get("firstMessage", "")
    assert len(first_message) > 10, "First message should be a real greeting"
    assert first_message[0].isupper(), "First message should start with capital letter"

Programmatic Call Testing via Vapi API

import time

def start_test_call(phone_number: str, assistant_id: str) -> dict:
    """Initiate a test call through Vapi"""
    payload = {
        "phoneNumberId": "your-phone-number-id",
        "assistantId": assistant_id,
        "customer": {
            "number": phone_number,
            "name": "Test User"
        }
    }
    
    response = requests.post(
        f"{VAPI_BASE_URL}/call/phone",
        headers=headers,
        json=payload
    )
    response.raise_for_status()
    return response.json()

def get_call_details(call_id: str) -> dict:
    """Get call details after it completes"""
    response = requests.get(
        f"{VAPI_BASE_URL}/call/{call_id}",
        headers=headers
    )
    response.raise_for_status()
    return response.json()

def wait_for_call_completion(call_id: str, timeout_seconds: int = 120) -> dict:
    """Poll until call completes"""
    start_time = time.time()
    while time.time() - start_time < timeout_seconds:
        call = get_call_details(call_id)
        if call.get("status") in ["ended", "failed"]:
            return call
        time.sleep(3)
    raise TimeoutError(f"Call {call_id} did not complete within {timeout_seconds}s")

# Note: These tests require real telephony and incur costs
# Run with: pytest -m "voice_e2e" only in dedicated test environments
@pytest.mark.voice_e2e
def test_basic_call_completes():
    """Test that a call can be initiated and completes"""
    call = start_test_call("+15551234567", AGENT_ID)
    call_id = call["id"]
    
    completed_call = wait_for_call_completion(call_id)
    
    assert completed_call["status"] == "ended"
    assert completed_call.get("endedReason") != "error"

Testing with Vapi's Chat API (No Phone Required)

Vapi provides a chat interface to test conversation logic without making real calls:

def chat_with_assistant(
    assistant_id: str,
    messages: list[dict],
    session_id: str = None
) -> dict:
    """Test conversation logic via Vapi chat API"""
    payload = {
        "assistantId": assistant_id,
        "input": messages[-1]["content"],
    }
    if session_id:
        payload["sessionId"] = session_id
    
    response = requests.post(
        f"{VAPI_BASE_URL}/chat",
        headers=headers,
        json=payload
    )
    response.raise_for_status()
    return response.json()

def test_greeting_response():
    """Agent should greet caller appropriately"""
    response = chat_with_assistant(AGENT_ID, [
        {"role": "user", "content": "Hello"}
    ])
    
    output = response.get("output", "").lower()
    assert any(word in output for word in ["hello", "hi", "welcome", "good morning"])
    assert len(output) > 20, "Greeting should be substantive"

def test_appointment_booking_intent():
    """Agent should route to booking tool when asked to book"""
    response = chat_with_assistant(AGENT_ID, [
        {"role": "user", "content": "I'd like to book an appointment for next Tuesday"}
    ])
    
    # Either the agent responds with available slots or calls the booking tool
    output = response.get("output", "").lower()
    tool_calls = response.get("toolCalls", [])
    
    booking_handled = (
        "available" in output or
        "appointment" in output or
        any(tc.get("name") == "check_availability" for tc in tool_calls)
    )
    
    assert booking_handled, (
        f"Agent should handle booking intent. Output: {output[:200]}, Tools: {tool_calls}"
    )

def test_out_of_scope_redirect():
    """Agent should redirect off-topic questions"""
    response = chat_with_assistant(AGENT_ID, [
        {"role": "user", "content": "Can you write me a poem about cats?"}
    ])
    
    output = response.get("output", "").lower()
    
    # Should politely redirect, not write a poem
    assert "poem" not in output or "sorry" in output or "help you with" in output
    # Should mention what it CAN help with
    assert any(word in output for word in ["appointment", "schedule", "book", "help you"])

Testing Retell AI

Retell provides a similar programmatic testing API:

RETELL_API_KEY = "key_xxxxxxxx"
RETELL_BASE_URL = "https://api.retellai.com"
RETELL_AGENT_ID = "your-retell-agent-id"

retell_headers = {
    "Authorization": f"Bearer {RETELL_API_KEY}",
    "Content-Type": "application/json"
}

def test_retell_agent_exists_and_configured():
    """Verify Retell agent is properly configured"""
    response = requests.get(
        f"{RETELL_BASE_URL}/get-agent/{RETELL_AGENT_ID}",
        headers=retell_headers
    )
    assert response.status_code == 200
    
    agent = response.json()
    assert agent.get("llm_websocket_url") or agent.get("response_engine"), (
        "Agent must have an LLM endpoint configured"
    )
    assert agent.get("voice_id"), "Agent must have a voice configured"

def create_retell_web_call(agent_id: str) -> dict:
    """Create a web call for testing (no phone needed)"""
    response = requests.post(
        f"{RETELL_BASE_URL}/create-web-call",
        headers=retell_headers,
        json={"agent_id": agent_id}
    )
    response.raise_for_status()
    return response.json()

def test_retell_web_call_creation():
    """Web call should be created with access token"""
    call_data = create_retell_web_call(RETELL_AGENT_ID)
    
    assert "call_id" in call_data
    assert "access_token" in call_data
    assert call_data["call_status"] in ["created", "registered"]

Testing Webhooks (Critical Path)

Your webhook handler receives all call events. This is where most production bugs live:

from fastapi import FastAPI, Request
from fastapi.testclient import TestClient
import json

app = FastAPI()

# Store received events for testing
received_events = []

@app.post("/webhook/vapi")
async def vapi_webhook(request: Request):
    payload = await request.json()
    received_events.append(payload)
    
    message_type = payload.get("message", {}).get("type")
    
    if message_type == "tool-calls":
        # Handle tool calls
        tool_calls = payload["message"]["toolCallList"]
        results = []
        for tc in tool_calls:
            result = await handle_tool_call(tc["function"]["name"], tc["function"]["arguments"])
            results.append({
                "toolCallId": tc["id"],
                "result": result
            })
        return {"results": results}
    
    return {"status": "ok"}

async def handle_tool_call(name: str, arguments: dict) -> str:
    if name == "check_availability":
        return json.dumps({"available_slots": ["Tuesday 2pm", "Wednesday 10am"]})
    elif name == "book_appointment":
        return json.dumps({"confirmed": True, "booking_id": "BK-12345"})
    return json.dumps({"error": "Unknown tool"})

# Tests for webhook handler
webhook_client = TestClient(app)

def test_webhook_handles_tool_call():
    """Webhook should execute tool calls and return results"""
    tool_call_event = {
        "message": {
            "type": "tool-calls",
            "toolCallList": [
                {
                    "id": "call_abc123",
                    "function": {
                        "name": "check_availability",
                        "arguments": {"date": "next Tuesday"}
                    }
                }
            ]
        }
    }
    
    response = webhook_client.post("/webhook/vapi", json=tool_call_event)
    
    assert response.status_code == 200
    data = response.json()
    assert "results" in data
    assert len(data["results"]) == 1
    assert data["results"][0]["toolCallId"] == "call_abc123"
    
    # Verify the result is valid JSON with available slots
    result = json.loads(data["results"][0]["result"])
    assert "available_slots" in result

def test_webhook_handles_call_ended_event():
    """Webhook should process call end without errors"""
    call_ended_event = {
        "message": {
            "type": "end-of-call-report",
            "call": {
                "id": "call-test-001",
                "status": "ended",
                "endedReason": "customer-ended-call",
                "durationSeconds": 120
            },
            "summary": "Customer booked appointment for Tuesday 2pm",
            "transcript": "User: Hello\nAssistant: Hi! How can I help?"
        }
    }
    
    response = webhook_client.post("/webhook/vapi", json=call_ended_event)
    assert response.status_code == 200

def test_webhook_rejects_invalid_payload():
    """Malformed webhook payloads should return 400"""
    response = webhook_client.post("/webhook/vapi", data="not json", 
                                    headers={"content-type": "application/json"})
    assert response.status_code in [400, 422]

def test_webhook_handles_unknown_tool_gracefully():
    """Unknown tool calls should return error result, not 500"""
    event = {
        "message": {
            "type": "tool-calls",
            "toolCallList": [
                {
                    "id": "call_xyz",
                    "function": {
                        "name": "nonexistent_tool",
                        "arguments": {}
                    }
                }
            ]
        }
    }
    
    response = webhook_client.post("/webhook/vapi", json=event)
    assert response.status_code == 200  # Webhook should not crash
    data = response.json()
    result = json.loads(data["results"][0]["result"])
    assert "error" in result

Testing Conversation Flows End-to-End

Test complete conversation scenarios using Vapi's chat API:

def run_conversation_flow(turns: list[str], agent_id: str) -> list[dict]:
    """Run a multi-turn conversation and return all responses"""
    responses = []
    session_id = f"test-session-{int(time.time())}"
    
    for user_message in turns:
        response = chat_with_assistant(agent_id, [
            {"role": "user", "content": user_message}
        ], session_id=session_id)
        responses.append(response)
        time.sleep(0.5)  # Small delay between turns
    
    return responses

def test_complete_booking_flow():
    """Test the full appointment booking conversation"""
    turns = [
        "Hi, I need to book an appointment",
        "I'm available next Tuesday afternoon",
        "2pm works great for me",
        "My name is John Smith, phone 555-0100"
    ]
    
    responses = run_conversation_flow(turns, AGENT_ID)
    
    assert len(responses) == 4
    
    # Turn 1: Should ask what kind of appointment
    turn1_output = responses[0].get("output", "").lower()
    assert any(word in turn1_output for word in ["what", "type", "appointment", "help"])
    
    # Turn 4 (confirmation): Should confirm booking
    turn4_output = responses[3].get("output", "").lower()
    assert any(word in turn4_output for word in ["confirmed", "booked", "scheduled", "appointment"])

def test_cancellation_flow():
    """Test that the agent can handle a cancellation request"""
    turns = [
        "I need to cancel my appointment",
        "My booking ID is BK-99123",
        "Yes, please cancel it"
    ]
    
    responses = run_conversation_flow(turns, AGENT_ID)
    
    # Final response should confirm cancellation
    final_output = responses[-1].get("output", "").lower()
    assert any(word in final_output for word in ["cancelled", "canceled", "confirmed", "done"])

Latency Testing

Voice AI has strict latency requirements. Response delays above 1.5-2 seconds feel broken:

import statistics

def measure_chat_latency(message: str, agent_id: str) -> float:
    """Measure response time for a single chat message in ms"""
    start = time.perf_counter()
    chat_with_assistant(agent_id, [{"role": "user", "content": message}])
    return (time.perf_counter() - start) * 1000

@pytest.mark.performance
def test_chat_response_latency():
    """Response latency should meet voice UX requirements"""
    LATENCY_SLO_MS = 2000  # 2 seconds max for voice
    
    test_messages = [
        "Hello",
        "I need to book an appointment",
        "What times are available tomorrow?",
        "Can I reschedule?",
        "Thank you, goodbye"
    ]
    
    latencies = []
    for msg in test_messages:
        latency = measure_chat_latency(msg, AGENT_ID)
        latencies.append(latency)
        print(f"'{msg[:30]}' → {latency:.0f}ms")
    
    avg = statistics.mean(latencies)
    p95 = sorted(latencies)[int(len(latencies) * 0.95)]
    
    print(f"\nAverage: {avg:.0f}ms, P95: {p95:.0f}ms")
    
    assert avg < LATENCY_SLO_MS, f"Average latency {avg:.0f}ms exceeds {LATENCY_SLO_MS}ms SLO"
    assert p95 < LATENCY_SLO_MS * 1.5, f"P95 latency {p95:.0f}ms too high"

CI Pipeline for Voice Agent Tests

# .github/workflows/voice-agent-tests.yml
name: Voice Agent Tests

on:
  push:
    branches: [main]
    paths:
      - 'voice-agent/**'
      - 'webhooks/**'

jobs:
  webhook-tests:
    # Fast, no external calls, run on every PR
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install pytest fastapi httpx
      - run: pytest tests/webhooks/ -v

  agent-config-tests:
    # Validate agent config, read-only API calls
    runs-on: ubuntu-latest
    env:
      VAPI_API_KEY: ${{ secrets.VAPI_API_KEY }}
    steps:
      - run: pytest tests/agent_config/ -v

  conversation-flow-tests:
    # Full conversation tests, run on main only
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    env:
      VAPI_API_KEY: ${{ secrets.VAPI_API_KEY }}
    steps:
      - run: pytest tests/conversation_flows/ -v -m "not voice_e2e"

End-to-End Monitoring

After deployment, monitor your voice agent continuously. Key metrics:

  • Call completion rate: % of calls that end cleanly vs. dropped
  • Tool call success rate: % of tool calls that return valid results
  • First response latency: Time to first audio byte
  • Conversation completion rate: % of calls that achieve the goal

For automated end-to-end voice agent monitoring — testing that your full call flow still works after code changes — HelpMeTest can run scheduled health checks that simulate the webhook interactions your voice agent depends on, alerting you when tool call handlers break before users call in.

Key Takeaways

Voice AI testing has three distinct layers: configuration validation (is the agent set up correctly?), webhook testing (does your server handle call events?), and conversation flow testing (does the agent handle real scenarios?). Always test webhooks with unit tests — they're the highest ROI testing investment for voice agents. Use the chat API for conversation flow testing without incurring real call costs. Add latency tests since 2+ second delays are immediately noticeable in phone calls. Run webhook tests on every PR, conversation flow tests on main.

Read more