DeepEval Tutorial: Unit Testing for LLMs
DeepEval brings unit testing discipline to LLM applications. Write assertions on model outputs the same way you write pytest assertions on function return values — with G-Eval metrics, faithfulness checks, and hallucination detection. This tutorial walks you from installation to CI-integrated test suite.
Why LLM Applications Need Unit Tests
You'd never ship a REST API without testing that it returns the right status codes. Yet teams ship LLM features tested only by "I tried it a few times and it seemed fine."
The problem: LLM outputs are stochastic. A prompt that works today may silently degrade after a model update, a system prompt change, or a context window expansion. You need reproducible, automated assertions.
DeepEval is the pytest of LLM testing. It gives you:
- Metric-based assertions — define what "correct" means in measurable terms
- Pytest integration — run LLM tests with
pytest, get familiar output - Dataset-driven evaluation — test against golden datasets, not just one-off examples
- CI-ready — fail your pipeline when output quality drops
Installation
pip install deepevalAuthenticate (stores credentials locally):
deepeval loginDeepEval uses GPT-4 as a judge by default. You can swap this out — more on that below.
Core Concepts
Before writing tests, understand the building blocks.
LLMTestCase
The fundamental unit. Contains the input, actual output, and optional context:
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
expected_output="Paris", # optional
retrieval_context=["France is a country in Western Europe. Its capital is Paris."] # for RAG
)Metrics
Metrics evaluate a LLMTestCase and return a score plus a reason. Built-in metrics:
| Metric | What it measures |
|---|---|
AnswerRelevancyMetric |
Does the answer address the question? |
FaithfulnessMetric |
Does the answer stick to the retrieved context? |
ContextualPrecisionMetric |
Are retrieved chunks relevant to the question? |
ContextualRecallMetric |
Did retrieval capture what was needed? |
HallucinationMetric |
Does the output contradict the context? |
GEval |
Custom criteria, LLM-judged |
ToxicityMetric |
Harmful content detection |
BiasMetric |
Bias in outputs |
Your First DeepEval Test
Create test_qa.py:
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
# The function under test — your LLM call
def ask_llm(question: str, context: list[str]) -> str:
# Replace with your actual LLM call
import openai
client = openai.OpenAI()
context_str = "\n".join(context)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based only on this context:\n{context_str}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
def test_answer_relevancy():
context = [
"HelpMeTest is a cloud-hosted SaaS platform for automated testing.",
"It supports Robot Framework with Playwright for browser automation.",
"Pricing starts at $100/month for the Pro plan."
]
test_case = LLMTestCase(
input="What testing frameworks does HelpMeTest support?",
actual_output=ask_llm("What testing frameworks does HelpMeTest support?", context),
retrieval_context=context
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8)
])Run it:
pytest test_qa.py -vDeepEval intercepts pytest, evaluates metrics using an LLM judge, and reports scores with reasons:
PASSED test_qa.py::test_answer_relevancy
AnswerRelevancyMetric: 0.92 (threshold: 0.70) ✓
FaithfulnessMetric: 0.88 (threshold: 0.80) ✓G-Eval: Custom Criteria
For domain-specific quality criteria, use GEval. Define your evaluation in plain English:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
# Evaluate whether the output is appropriately concise
conciseness_metric = GEval(
name="Conciseness",
criteria="The output should be direct and avoid unnecessary filler words. A one-sentence answer for a simple factual question is ideal.",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT
],
threshold=0.7
)
# Evaluate tone
professional_tone_metric = GEval(
name="Professional Tone",
criteria="The output should be professional, neutral, and appropriate for a business context. No casual language, slang, or excessive enthusiasm.",
evaluation_params=[
LLMTestCaseParams.ACTUAL_OUTPUT
],
threshold=0.8
)Use in tests:
def test_response_quality():
test_case = LLMTestCase(
input="What's HelpMeTest's pricing?",
actual_output=ask_llm("What's HelpMeTest's pricing?", context)
)
assert_test(test_case, [conciseness_metric, professional_tone_metric])RAG Evaluation: The Full Pipeline
RAG applications have two failure modes: bad retrieval and bad generation. DeepEval covers both.
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric,
FaithfulnessMetric,
AnswerRelevancyMetric
)
def test_rag_pipeline():
question = "How does HelpMeTest handle visual testing?"
# Simulate your retriever
retrieved_chunks = retrieve_from_vector_db(question)
# Simulate your generator
answer = generate_answer(question, retrieved_chunks)
test_case = LLMTestCase(
input=question,
actual_output=answer,
expected_output="HelpMeTest uses AI-powered visual flaw detection with baseline comparison across mobile, tablet, and desktop viewports.",
retrieval_context=retrieved_chunks
)
assert_test(test_case, [
# Retrieval quality
ContextualPrecisionMetric(threshold=0.7), # retrieved chunks are relevant
ContextualRecallMetric(threshold=0.7), # relevant chunks were retrieved
ContextualRelevancyMetric(threshold=0.7), # chunks relate to the question
# Generation quality
FaithfulnessMetric(threshold=0.8), # answer grounded in context
AnswerRelevancyMetric(threshold=0.8) # answer addresses the question
])This gives you a complete picture: if precision is low, your retriever returns junk. If faithfulness is low, your generator hallucinates.
Hallucination Detection
from deepeval.metrics import HallucinationMetric
def test_no_hallucination():
context = [
"HelpMeTest Pro costs $100/month.",
"The free plan allows up to 10 tests."
]
test_case = LLMTestCase(
input="How much does HelpMeTest cost?",
actual_output="HelpMeTest offers a free plan with up to 10 tests, and a Pro plan at $100/month with unlimited tests.",
context=context # Note: HallucinationMetric uses 'context', not 'retrieval_context'
)
metric = HallucinationMetric(threshold=0.5) # score below 0.5 = low hallucination
assert_test(test_case, [metric])Dataset-Driven Evaluation
One-off tests aren't enough. Build golden datasets and evaluate against them:
from deepeval.dataset import EvaluationDataset, Golden
# Define golden examples
goldens = [
Golden(
input="What is HelpMeTest?",
expected_output="HelpMeTest is a cloud-hosted SaaS platform for automated testing using Robot Framework and Playwright."
),
Golden(
input="Does HelpMeTest support self-hosting?",
expected_output="No, HelpMeTest is a cloud-hosted SaaS. Self-hosting is not available."
),
Golden(
input="What monitoring intervals does HelpMeTest offer?",
expected_output="HelpMeTest monitors every 5 minutes on the free plan and every 10 seconds on the Enterprise plan."
),
]
dataset = EvaluationDataset(goldens=goldens)
# Evaluate the dataset
@pytest.mark.parametrize("golden", dataset.goldens)
def test_golden_dataset(golden):
actual = ask_llm(golden.input, context=[])
test_case = LLMTestCase(
input=golden.input,
actual_output=actual,
expected_output=golden.expected_output
)
assert_test(test_case, [AnswerRelevancyMetric(threshold=0.7)])Using a Local Model as Judge
By default, DeepEval uses GPT-4 to judge outputs. For cost control or privacy, use a local model:
from deepeval.models import DeepEvalBaseLLM
from ollama import Client
class OllamaJudge(DeepEvalBaseLLM):
def __init__(self, model="llama3.1:8b"):
self.model = model
self.client = Client()
def load_model(self):
return self.client
def generate(self, prompt: str) -> str:
response = self.client.generate(
model=self.model,
prompt=prompt
)
return response["response"]
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
def get_model_name(self):
return f"ollama/{self.model}"
# Use it in metrics
judge = OllamaJudge()
metric = AnswerRelevancyMetric(threshold=0.7, model=judge)Synthesizing Test Cases
Don't have a golden dataset? Generate one from your documentation:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
# Generate test cases from your documents
synthesizer.generate_goldens_from_docs(
document_paths=["docs/helpmetest-guide.pdf", "docs/api-reference.md"],
max_goldens_per_document=10
)
dataset = synthesizer.to_dataset()
print(f"Generated {len(dataset.goldens)} test cases")The synthesizer uses your documents to produce realistic questions and expected answers — accelerating coverage on new features.
CI Integration
Add DeepEval to your CI pipeline so quality regressions fail the build.
GitHub Actions
name: LLM Quality Tests
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install deepeval pytest
- name: Run LLM tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
run: pytest tests/llm/ -v --tb=shortHandling Flakiness
LLM tests can flake due to model non-determinism. Use temperature=0 in your model under test, and set conservative thresholds with a buffer:
# Instead of threshold=0.8 (fails at 0.79)
# Use threshold=0.65 with a comment explaining the headroom
metric = FaithfulnessMetric(
threshold=0.65, # conservative — model consistently scores 0.78-0.92 in baseline
include_reason=True
)Interpreting Failures
When a test fails, DeepEval explains why:
FAILED test_qa.py::test_faithfulness
FaithfulnessMetric: 0.41 (threshold: 0.80) ✗
Reason: The actual output claims HelpMeTest supports self-hosted deployments,
but the retrieval context only mentions cloud-hosted SaaS. This claim is not
supported by the provided context and constitutes a hallucination.
Statements that are not faithful:
- "You can deploy HelpMeTest on your own infrastructure"This tells you exactly what failed and why — far more useful than a generic assertion error.
Putting It All Together
A production-grade test file for a RAG chatbot:
# tests/llm/test_chatbot.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
GEval,
LLMTestCaseParams
)
from myapp.chatbot import answer_question, retrieve_context
ANSWER_RELEVANCY = AnswerRelevancyMetric(threshold=0.7, include_reason=True)
FAITHFULNESS = FaithfulnessMetric(threshold=0.75, include_reason=True)
HALLUCINATION = HallucinationMetric(threshold=0.4)
CONCISE = GEval(
name="Conciseness",
criteria="The answer is direct and does not pad with unnecessary text.",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7
)
@pytest.fixture
def qa_case():
def _build(question):
context = retrieve_context(question)
answer = answer_question(question, context)
return LLMTestCase(
input=question,
actual_output=answer,
retrieval_context=context
)
return _build
def test_pricing_question(qa_case):
assert_test(qa_case("What does HelpMeTest Pro cost?"), [
ANSWER_RELEVANCY, FAITHFULNESS, CONCISE
])
def test_feature_question(qa_case):
assert_test(qa_case("Does HelpMeTest support visual testing?"), [
ANSWER_RELEVANCY, FAITHFULNESS, HALLUCINATION
])
def test_negative_question(qa_case):
"""Model should answer 'no' without inventing features."""
assert_test(qa_case("Can I self-host HelpMeTest?"), [
FAITHFULNESS, HALLUCINATION
])DeepEval vs Manual Evaluation
| Manual review | DeepEval | |
|---|---|---|
| Speed | Hours per batch | Seconds |
| Consistency | Varies by reviewer | Deterministic thresholds |
| Coverage | Sample only | Every test, every build |
| Regression detection | None | Automatic |
| Cost | Engineer time | ~$0.01–0.05 per test case |
For teams shipping LLM features, DeepEval's cost-per-test is trivially small compared to the cost of a quality regression reaching users.
Next Steps
- Add to CI now — even three test cases with conservative thresholds will catch regressions
- Build your golden dataset — start with 10 representative questions, grow from there
- Try the synthesizer — generate coverage from your existing docs
- Explore Ragas for deeper RAG pipeline metrics including context precision at rank k
For continuous monitoring beyond unit tests — including scheduled runs and alerting — HelpMeTest runs your evaluation suites on a schedule and alerts when scores drop below threshold.