LLM API Testing Best Practices: How to Test AI Outputs Reliably

LLM API Testing Best Practices: How to Test AI Outputs Reliably

Testing LLM-powered applications is one of the most misunderstood areas of software quality. Traditional testing relies on determinism — the same input always produces the same output. LLMs don't work that way. This guide covers the patterns that actually work.

The Core Challenge: Testing Non-Determinism

When you test 2 + 2, you expect 4. When you ask GPT-4 "What's the capital of France?", you might get "Paris", "The capital of France is Paris.", "Paris is the capital of France.", or — rarely — something else entirely.

This creates a fundamental mismatch: most testing frameworks assume determinism.

The solution is to shift from exact matching to behavioral assertions. You're not testing what the LLM said; you're testing that your application did the right thing given the LLM's output.

Testing Layers

Good LLM application testing has three distinct layers:

Layer 1: Unit Tests (no LLM calls)
  └── Test your logic, prompts structure, parsing
  
Layer 2: Contract Tests (mock LLM, real your code)
  └── Test how your code handles various LLM outputs

Layer 3: Integration Tests (real LLM, real your code)
  └── Test end-to-end behavior with actual model

Most teams only have Layer 3. That's why their tests are slow, expensive, and flaky.

Layer 1: Unit Tests

Everything testable without calling an LLM should be tested without calling an LLM.

What belongs in unit tests:

  • Prompt construction
  • Input validation
  • Output parsing (given mock LLM output)
  • Error handling
  • Token counting / length checks
  • Business logic that uses LLM output
# tests/test_prompt_builder.py
from src.prompt_builder import PromptBuilder

class TestPromptBuilder:
    def test_includes_user_context(self):
        builder = PromptBuilder()
        prompt = builder.build_support_prompt(
            user_name="Alice",
            issue="My order hasn't arrived",
            order_id="ORD-12345"
        )
        assert "Alice" in prompt
        assert "ORD-12345" in prompt
        assert "order" in prompt.lower()

    def test_system_prompt_has_guardrails(self):
        builder = PromptBuilder()
        system = builder.get_system_prompt("support")
        assert "refund" in system.lower()  # Support agent knows refund policy
        assert "competitor" not in system.lower()  # Shouldn't mention competitors

    def test_prompt_within_token_budget(self):
        builder = PromptBuilder()
        prompt = builder.build_rag_prompt(
            query="What is your pricing?",
            context_chunks=["chunk " * 100] * 10  # Large context
        )
        # Rough token estimate: 4 chars ≈ 1 token
        estimated_tokens = len(prompt) / 4
        assert estimated_tokens < 4000  # Under context limit

Layer 2: Contract Tests

Contract tests verify how your application code handles different types of LLM outputs — without actually calling the LLM. You define the contract of what the LLM might return and test all variations.

# tests/test_output_parser.py
import pytest
from src.output_parser import parse_classification_response

VALID_CLASSIFICATIONS = [
    # Exact match
    ("positive", "positive"),
    ("POSITIVE", "positive"),
    ("Positive", "positive"),
    # With extra text
    ("The sentiment is positive.", "positive"),
    ("Based on the text, I'd classify this as: positive", "positive"),
    # Negative variations
    ("negative", "negative"),
    ("This text expresses a negative sentiment.", "negative"),
    # Neutral
    ("neutral", "neutral"),
]

INVALID_CLASSIFICATIONS = [
    "",
    "I'm not sure",
    "ambiguous",
    "The text could be either positive or negative",
]

@pytest.mark.parametrize("raw_output,expected", VALID_CLASSIFICATIONS)
def test_parses_valid_classification(raw_output, expected):
    result = parse_classification_response(raw_output)
    assert result == expected

@pytest.mark.parametrize("raw_output", INVALID_CLASSIFICATIONS)
def test_handles_invalid_classification(raw_output):
    with pytest.raises(ValueError):
        parse_classification_response(raw_output)

This approach gives you confidence that your parsing is robust to LLM output variations — without spending money on API calls.

Layer 3: Integration Tests

Real LLM calls should be:

  • Slow by design — accept that they take 5-30 seconds
  • Run infrequently — not on every commit
  • Behavioral, not exact — check what matters, not word-for-word

Behavioral Assertions

# tests/test_integration.py
import pytest
import os

pytestmark = pytest.mark.skipif(
    os.environ.get("RUN_INTEGRATION_TESTS") != "true",
    reason="Gated: set RUN_INTEGRATION_TESTS=true"
)

class TestLLMBehavior:
    def test_summarizer_produces_shorter_text(self, llm_client):
        """Summary should be shorter than the original."""
        original = "Software testing is the process of evaluating software " * 50
        summary = llm_client.summarize(original)
        assert len(summary) < len(original) * 0.4

    def test_summarizer_covers_main_topic(self, llm_client):
        """Summary must mention the main topic."""
        original = "Machine learning models require large datasets for training..."
        summary = llm_client.summarize(original)
        assert any(word in summary.lower() for word in ["machine", "learning", "model", "data"])

    def test_classifier_consistency(self, llm_client):
        """Same input should produce same output when temperature=0."""
        text = "This is absolutely fantastic, I love it!"
        results = [llm_client.classify(text) for _ in range(3)]
        assert len(set(results)) == 1  # All identical at temperature=0

    def test_respects_max_length(self, llm_client):
        """Output must honor length constraints."""
        result = llm_client.generate_with_limit("Write a long story", max_words=50)
        word_count = len(result.split())
        assert word_count <= 60  # Allow slight overage due to token boundaries

Property-Based Testing for LLM Outputs

Instead of testing specific outputs, test properties that should always hold:

from hypothesis import given, strategies as st, settings

@settings(max_examples=20, deadline=60000)  # Each example may be slow
@given(
    text=st.text(min_size=20, max_size=500, alphabet=st.characters(whitelist_categories=["L", "Z"]))
)
def test_classifier_always_returns_valid_category(text, llm_client_mock):
    """Classifier must always return one of the valid categories."""
    valid_categories = {"positive", "negative", "neutral"}
    # With mock, exercise your parsing/validation code
    result = llm_client_mock.classify(text)
    assert result in valid_categories

@settings(max_examples=10, deadline=120000)
@given(
    length=st.integers(min_value=100, max_value=1000)
)
def test_summary_always_shorter_than_input(length, llm_client):
    """For any input, summary must be shorter."""
    text = "word " * length
    summary = llm_client.summarize(text)
    assert len(summary) < len(text)

Testing Prompt Engineering

Prompt changes are code changes. Treat them that way.

Regression Test Suite for Prompts

# tests/test_prompt_regression.py
import pytest

# Golden test cases: input + expected behavior
REGRESSION_CASES = [
    {
        "id": "support-refund-001",
        "input": "I want to return my broken product and get my money back",
        "must_contain": ["refund", "return"],
        "must_not_contain": ["competitor", "lawsuit", "illegal"],
        "category": "refund_request"
    },
    {
        "id": "support-shipping-001",
        "input": "Where is my order? It's been 2 weeks",
        "must_contain": ["order", "track"],
        "must_not_contain": ["fault", "sorry we can't"],
        "category": "shipping_inquiry"
    },
    {
        "id": "support-technical-001",
        "input": "The app crashes when I try to upload a file",
        "must_contain": ["file", "upload"],
        "must_not_contain": ["not our problem"],
        "category": "technical_support"
    }
]

@pytest.mark.parametrize("case", REGRESSION_CASES, ids=lambda c: c["id"])
def test_support_response_regression(case, mock_support_bot):
    """Regression tests for support bot responses."""
    mock_support_bot.set_response_keywords(case["must_contain"])
    
    result = mock_support_bot.handle(case["input"])
    
    for keyword in case["must_contain"]:
        assert keyword in result.text.lower(), f"Missing '{keyword}' in response"
    
    for banned in case["must_not_contain"]:
        assert banned not in result.text.lower(), f"Banned term '{banned}' found in response"
    
    assert result.category == case["category"]

Evaluating LLM Output Quality

For production LLM applications, you often need to measure output quality over time. Common approaches:

LLM-as-Judge

Use a second LLM call to evaluate the output of the first:

def evaluate_response_quality(
    question: str,
    response: str,
    judge_client
) -> dict:
    """Use LLM to judge LLM output quality."""
    prompt = f"""Rate this customer support response on a scale of 1-5.

Question: {question}
Response: {response}

Score criteria:
- 5: Accurate, helpful, professional, addresses the question completely
- 3: Partially helpful, some gaps
- 1: Inaccurate, unhelpful, or unprofessional

Reply ONLY with a JSON object: {{"score": N, "reason": "one sentence"}}"""

    result = judge_client.generate_json(prompt)
    return result

class TestEvaluation:
    def test_high_quality_response_scores_well(self, judge_mock):
        judge_mock.return_value = {"score": 4, "reason": "Clear and helpful"}
        response = "Your refund will be processed within 5-7 business days."
        score = evaluate_response_quality("When do I get my refund?", response, judge_mock)
        assert score["score"] >= 4

    def test_low_quality_response_scores_poorly(self, judge_mock):
        judge_mock.return_value = {"score": 1, "reason": "Doesn't answer the question"}
        response = "Please contact our support team."
        score = evaluate_response_quality("When do I get my refund?", response, judge_mock)
        assert score["score"] <= 2

Semantic Similarity Scoring

For RAG and summarization, check semantic similarity instead of exact text:

from sentence_transformers import SentenceTransformer, util

@pytest.fixture(scope="session")
def embedding_model():
    return SentenceTransformer("all-MiniLM-L6-v2")

def test_summary_semantically_similar_to_source(embedding_model, llm_client_mock):
    """Summary should be semantically close to the original."""
    source = "Python is a high-level programming language known for readability."
    summary = "Python prioritizes code readability and is a popular programming language."
    
    embeddings = embedding_model.encode([source, summary])
    similarity = util.cos_sim(embeddings[0], embeddings[1]).item()
    
    assert similarity > 0.85  # High semantic similarity

Handling Flaky LLM Tests

LLM integration tests can be flaky for reasons outside your control: rate limits, model updates, transient API errors. Strategies:

# Retry decorator for flaky LLM tests
import pytest
import time

def retry_on_api_error(max_retries=3, delay=5):
    """Decorator to retry on transient LLM API errors."""
    def decorator(func):
        @pytest.mark.flaky(reruns=max_retries, reruns_delay=delay)
        def wrapper(*args, **kwargs):
            return func(*args, **kwargs)
        return wrapper
    return decorator

@retry_on_api_error(max_retries=3)
def test_llm_api_handles_load(llm_client):
    results = []
    for _ in range(5):
        results.append(llm_client.classify("Positive review text"))
    assert all(r == "positive" for r in results)

Or use pytest-rerunfailures:

pip install pytest-rerunfailures
pytest tests/test_integration.py --reruns 3 --reruns-delay 5

Key Testing Principles

1. Don't test what the LLM says — test what your app does. Your job is to verify that given a plausible LLM output, your application behaves correctly. The LLM vendor tests the model; you test the integration.

2. Use temperature=0 for deterministic integration tests. Classification, extraction, and structured output tests should use temperature=0 to reduce flakiness.

3. Separate prompt tests from business logic tests. Prompts can change without changing business logic, and vice versa. Keep them in separate test files.

4. Always test the error paths. The LLM will return unexpected output in production. Test what happens when it does: malformed JSON, wrong category names, empty responses, refused requests.

5. Build a golden dataset, expand it continuously. Start with 10 representative test cases. Add a new case every time you find a bug or edge case in production. Over time, you'll have comprehensive regression coverage.

End-to-End Coverage with HelpMeTest

Unit and integration tests cover your API layer. But users interact with your LLM features through a UI — a chat interface, a document upload screen, a results page. Those paths need testing too.

HelpMeTest runs automated browser tests against your production application, catching issues that only appear in the full stack — UI state bugs, timeout mismatches, or prompt changes that break the user experience without breaking any unit test.

The combination of unit tests (fast, free) + integration tests (weekly, real model) + end-to-end tests (full stack, continuous) gives you comprehensive coverage of an LLM-powered application at every layer.

Read more