DeepEval Tutorial: Unit Testing for LLMs

DeepEval Tutorial: Unit Testing for LLMs

DeepEval brings unit testing discipline to LLM applications. Write assertions on model outputs the same way you write pytest assertions on function return values — with G-Eval metrics, faithfulness checks, and hallucination detection. This tutorial walks you from installation to CI-integrated test suite.


Why LLM Applications Need Unit Tests

You'd never ship a REST API without testing that it returns the right status codes. Yet teams ship LLM features tested only by "I tried it a few times and it seemed fine."

The problem: LLM outputs are stochastic. A prompt that works today may silently degrade after a model update, a system prompt change, or a context window expansion. You need reproducible, automated assertions.

DeepEval is the pytest of LLM testing. It gives you:

  • Metric-based assertions — define what "correct" means in measurable terms
  • Pytest integration — run LLM tests with pytest, get familiar output
  • Dataset-driven evaluation — test against golden datasets, not just one-off examples
  • CI-ready — fail your pipeline when output quality drops

Installation

pip install deepeval

Authenticate (stores credentials locally):

deepeval login

DeepEval uses GPT-4 as a judge by default. You can swap this out — more on that below.


Core Concepts

Before writing tests, understand the building blocks.

LLMTestCase

The fundamental unit. Contains the input, actual output, and optional context:

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    expected_output="Paris",          # optional
    retrieval_context=["France is a country in Western Europe. Its capital is Paris."]  # for RAG
)

Metrics

Metrics evaluate a LLMTestCase and return a score plus a reason. Built-in metrics:

Metric What it measures
AnswerRelevancyMetric Does the answer address the question?
FaithfulnessMetric Does the answer stick to the retrieved context?
ContextualPrecisionMetric Are retrieved chunks relevant to the question?
ContextualRecallMetric Did retrieval capture what was needed?
HallucinationMetric Does the output contradict the context?
GEval Custom criteria, LLM-judged
ToxicityMetric Harmful content detection
BiasMetric Bias in outputs

Your First DeepEval Test

Create test_qa.py:

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

# The function under test — your LLM call
def ask_llm(question: str, context: list[str]) -> str:
    # Replace with your actual LLM call
    import openai
    client = openai.OpenAI()
    context_str = "\n".join(context)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based only on this context:\n{context_str}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content


def test_answer_relevancy():
    context = [
        "HelpMeTest is a cloud-hosted SaaS platform for automated testing.",
        "It supports Robot Framework with Playwright for browser automation.",
        "Pricing starts at $100/month for the Pro plan."
    ]
    
    test_case = LLMTestCase(
        input="What testing frameworks does HelpMeTest support?",
        actual_output=ask_llm("What testing frameworks does HelpMeTest support?", context),
        retrieval_context=context
    )
    
    assert_test(test_case, [
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8)
    ])

Run it:

pytest test_qa.py -v

DeepEval intercepts pytest, evaluates metrics using an LLM judge, and reports scores with reasons:

PASSED test_qa.py::test_answer_relevancy
  AnswerRelevancyMetric: 0.92 (threshold: 0.70) ✓
  FaithfulnessMetric: 0.88 (threshold: 0.80) ✓

G-Eval: Custom Criteria

For domain-specific quality criteria, use GEval. Define your evaluation in plain English:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Evaluate whether the output is appropriately concise
conciseness_metric = GEval(
    name="Conciseness",
    criteria="The output should be direct and avoid unnecessary filler words. A one-sentence answer for a simple factual question is ideal.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.7
)

# Evaluate tone
professional_tone_metric = GEval(
    name="Professional Tone",
    criteria="The output should be professional, neutral, and appropriate for a business context. No casual language, slang, or excessive enthusiasm.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.8
)

Use in tests:

def test_response_quality():
    test_case = LLMTestCase(
        input="What's HelpMeTest's pricing?",
        actual_output=ask_llm("What's HelpMeTest's pricing?", context)
    )
    
    assert_test(test_case, [conciseness_metric, professional_tone_metric])

RAG Evaluation: The Full Pipeline

RAG applications have two failure modes: bad retrieval and bad generation. DeepEval covers both.

from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    FaithfulnessMetric,
    AnswerRelevancyMetric
)

def test_rag_pipeline():
    question = "How does HelpMeTest handle visual testing?"
    
    # Simulate your retriever
    retrieved_chunks = retrieve_from_vector_db(question)
    
    # Simulate your generator
    answer = generate_answer(question, retrieved_chunks)
    
    test_case = LLMTestCase(
        input=question,
        actual_output=answer,
        expected_output="HelpMeTest uses AI-powered visual flaw detection with baseline comparison across mobile, tablet, and desktop viewports.",
        retrieval_context=retrieved_chunks
    )
    
    assert_test(test_case, [
        # Retrieval quality
        ContextualPrecisionMetric(threshold=0.7),   # retrieved chunks are relevant
        ContextualRecallMetric(threshold=0.7),       # relevant chunks were retrieved
        ContextualRelevancyMetric(threshold=0.7),    # chunks relate to the question
        
        # Generation quality
        FaithfulnessMetric(threshold=0.8),           # answer grounded in context
        AnswerRelevancyMetric(threshold=0.8)         # answer addresses the question
    ])

This gives you a complete picture: if precision is low, your retriever returns junk. If faithfulness is low, your generator hallucinates.


Hallucination Detection

from deepeval.metrics import HallucinationMetric

def test_no_hallucination():
    context = [
        "HelpMeTest Pro costs $100/month.",
        "The free plan allows up to 10 tests."
    ]
    
    test_case = LLMTestCase(
        input="How much does HelpMeTest cost?",
        actual_output="HelpMeTest offers a free plan with up to 10 tests, and a Pro plan at $100/month with unlimited tests.",
        context=context  # Note: HallucinationMetric uses 'context', not 'retrieval_context'
    )
    
    metric = HallucinationMetric(threshold=0.5)  # score below 0.5 = low hallucination
    assert_test(test_case, [metric])

Dataset-Driven Evaluation

One-off tests aren't enough. Build golden datasets and evaluate against them:

from deepeval.dataset import EvaluationDataset, Golden

# Define golden examples
goldens = [
    Golden(
        input="What is HelpMeTest?",
        expected_output="HelpMeTest is a cloud-hosted SaaS platform for automated testing using Robot Framework and Playwright."
    ),
    Golden(
        input="Does HelpMeTest support self-hosting?",
        expected_output="No, HelpMeTest is a cloud-hosted SaaS. Self-hosting is not available."
    ),
    Golden(
        input="What monitoring intervals does HelpMeTest offer?",
        expected_output="HelpMeTest monitors every 5 minutes on the free plan and every 10 seconds on the Enterprise plan."
    ),
]

dataset = EvaluationDataset(goldens=goldens)

# Evaluate the dataset
@pytest.mark.parametrize("golden", dataset.goldens)
def test_golden_dataset(golden):
    actual = ask_llm(golden.input, context=[])
    
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=actual,
        expected_output=golden.expected_output
    )
    
    assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])

Using a Local Model as Judge

By default, DeepEval uses GPT-4 to judge outputs. For cost control or privacy, use a local model:

from deepeval.models import DeepEvalBaseLLM
from ollama import Client

class OllamaJudge(DeepEvalBaseLLM):
    def __init__(self, model="llama3.1:8b"):
        self.model = model
        self.client = Client()
    
    def load_model(self):
        return self.client
    
    def generate(self, prompt: str) -> str:
        response = self.client.generate(
            model=self.model,
            prompt=prompt
        )
        return response["response"]
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self):
        return f"ollama/{self.model}"

# Use it in metrics
judge = OllamaJudge()
metric = AnswerRelevancyMetric(threshold=0.7, model=judge)

Synthesizing Test Cases

Don't have a golden dataset? Generate one from your documentation:

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

# Generate test cases from your documents
synthesizer.generate_goldens_from_docs(
    document_paths=["docs/helpmetest-guide.pdf", "docs/api-reference.md"],
    max_goldens_per_document=10
)

dataset = synthesizer.to_dataset()
print(f"Generated {len(dataset.goldens)} test cases")

The synthesizer uses your documents to produce realistic questions and expected answers — accelerating coverage on new features.


CI Integration

Add DeepEval to your CI pipeline so quality regressions fail the build.

GitHub Actions

name: LLM Quality Tests

on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      
      - name: Install dependencies
        run: pip install deepeval pytest
      
      - name: Run LLM tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
        run: pytest tests/llm/ -v --tb=short

Handling Flakiness

LLM tests can flake due to model non-determinism. Use temperature=0 in your model under test, and set conservative thresholds with a buffer:

# Instead of threshold=0.8 (fails at 0.79)
# Use threshold=0.65 with a comment explaining the headroom
metric = FaithfulnessMetric(
    threshold=0.65,  # conservative — model consistently scores 0.78-0.92 in baseline
    include_reason=True
)

Interpreting Failures

When a test fails, DeepEval explains why:

FAILED test_qa.py::test_faithfulness
  FaithfulnessMetric: 0.41 (threshold: 0.80) ✗
  
  Reason: The actual output claims HelpMeTest supports self-hosted deployments, 
  but the retrieval context only mentions cloud-hosted SaaS. This claim is not 
  supported by the provided context and constitutes a hallucination.
  
  Statements that are not faithful:
  - "You can deploy HelpMeTest on your own infrastructure"

This tells you exactly what failed and why — far more useful than a generic assertion error.


Putting It All Together

A production-grade test file for a RAG chatbot:

# tests/llm/test_chatbot.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval,
    LLMTestCaseParams
)

from myapp.chatbot import answer_question, retrieve_context

ANSWER_RELEVANCY = AnswerRelevancyMetric(threshold=0.7, include_reason=True)
FAITHFULNESS = FaithfulnessMetric(threshold=0.75, include_reason=True)
HALLUCINATION = HallucinationMetric(threshold=0.4)

CONCISE = GEval(
    name="Conciseness",
    criteria="The answer is direct and does not pad with unnecessary text.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7
)

@pytest.fixture
def qa_case():
    def _build(question):
        context = retrieve_context(question)
        answer = answer_question(question, context)
        return LLMTestCase(
            input=question,
            actual_output=answer,
            retrieval_context=context
        )
    return _build

def test_pricing_question(qa_case):
    assert_test(qa_case("What does HelpMeTest Pro cost?"), [
        ANSWER_RELEVANCY, FAITHFULNESS, CONCISE
    ])

def test_feature_question(qa_case):
    assert_test(qa_case("Does HelpMeTest support visual testing?"), [
        ANSWER_RELEVANCY, FAITHFULNESS, HALLUCINATION
    ])

def test_negative_question(qa_case):
    """Model should answer 'no' without inventing features."""
    assert_test(qa_case("Can I self-host HelpMeTest?"), [
        FAITHFULNESS, HALLUCINATION
    ])

DeepEval vs Manual Evaluation

Manual review DeepEval
Speed Hours per batch Seconds
Consistency Varies by reviewer Deterministic thresholds
Coverage Sample only Every test, every build
Regression detection None Automatic
Cost Engineer time ~$0.01–0.05 per test case

For teams shipping LLM features, DeepEval's cost-per-test is trivially small compared to the cost of a quality regression reaching users.


Next Steps

  • Add to CI now — even three test cases with conservative thresholds will catch regressions
  • Build your golden dataset — start with 10 representative questions, grow from there
  • Try the synthesizer — generate coverage from your existing docs
  • Explore Ragas for deeper RAG pipeline metrics including context precision at rank k

For continuous monitoring beyond unit tests — including scheduled runs and alerting — HelpMeTest runs your evaluation suites on a schedule and alerts when scores drop below threshold.

Read more