Building an LLM Evaluation Framework from Scratch: Metrics, Datasets, and CI
Off-the-shelf eval frameworks (DeepEval, Ragas, TruLens) cover 80% of use cases. For the remaining 20% — specialized domains, proprietary metrics, internal tooling constraints — you need to build your own. This guide walks you through designing metrics, curating datasets, implementing an LLM judge, and wiring it all into CI.
When to Build Your Own
Use DeepEval, Ragas, or TruLens if their built-in metrics cover your needs. Build your own when:
- Domain-specific correctness — "correct" in your domain can't be captured by generic faithfulness or relevancy scores
- Proprietary constraints — you can't send data to third-party eval APIs (medical records, legal documents)
- Custom scoring logic — combining multiple signals (rule-based + LLM-judged + human feedback) in ways frameworks don't support
- Internal tool integration — you need scores flowing into your own dashboards, alerting, or reporting infrastructure
Framework Architecture
A minimal eval framework has four components:
┌─────────────────┐
│ Test Dataset │ Golden QA pairs, edge cases, adversarial examples
└────────┬────────┘
│
┌────────▼────────┐
│ System Under │ Your LLM app — retriever + generator + tools
│ Test │
└────────┬────────┘
│
┌────────▼────────┐
│ Evaluators │ Rule-based, embedding-based, LLM-judged
└────────┬────────┘
│
┌────────▼────────┐
│ Results Store │ Scores, history, trend data
└─────────────────┘Step 1: Define Your Metrics
Before writing code, write down what "correct" means for each metric. Vague metrics produce noisy scores.
# metrics.py
from dataclasses import dataclass
from typing import Optional
@dataclass
class EvalResult:
metric_name: str
score: float # 0.0 to 1.0
passed: bool
reason: str
threshold: float
@dataclass
class TestCase:
id: str
question: str
context: list[str] # retrieved documents
answer: str # model output
ground_truth: Optional[str] = None
metadata: dict = NoneMetric Types
1. Rule-Based Metrics — fast, deterministic, no LLM cost:
def check_response_length(test_case: TestCase, min_words=10, max_words=200) -> EvalResult:
words = len(test_case.answer.split())
passed = min_words <= words <= max_words
return EvalResult(
metric_name="response_length",
score=1.0 if passed else 0.0,
passed=passed,
reason=f"Response is {words} words (expected {min_words}-{max_words})",
threshold=1.0
)
def check_no_forbidden_phrases(test_case: TestCase) -> EvalResult:
FORBIDDEN = ["I cannot", "As an AI", "self-host", "deploy yourself"]
violations = [p for p in FORBIDDEN if p.lower() in test_case.answer.lower()]
return EvalResult(
metric_name="no_forbidden_phrases",
score=0.0 if violations else 1.0,
passed=len(violations) == 0,
reason=f"Found forbidden phrases: {violations}" if violations else "No forbidden phrases",
threshold=1.0
)2. Embedding-Based Metrics — semantic similarity without LLM judge cost:
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_similarity(test_case: TestCase, threshold=0.75) -> EvalResult:
if not test_case.ground_truth:
return EvalResult(
metric_name="semantic_similarity",
score=0.0,
passed=False,
reason="No ground truth provided",
threshold=threshold
)
answer_emb = get_embedding(test_case.answer)
truth_emb = get_embedding(test_case.ground_truth)
score = cosine_similarity(answer_emb, truth_emb)
return EvalResult(
metric_name="semantic_similarity",
score=score,
passed=score >= threshold,
reason=f"Semantic similarity: {score:.2f}",
threshold=threshold
)3. LLM-as-Judge — flexible, high-quality, higher cost:
import json
def llm_judge(
test_case: TestCase,
criterion: str,
metric_name: str,
threshold: float = 0.7
) -> EvalResult:
"""Generic LLM judge with structured output."""
prompt = f"""You are an evaluator for an LLM application. Score the following response.
CRITERION: {criterion}
QUESTION: {test_case.question}
CONTEXT PROVIDED TO MODEL:
{chr(10).join(f'- {c}' for c in test_case.context)}
MODEL RESPONSE:
{test_case.answer}
Score the response on a scale of 0.0 to 1.0 based on the criterion above.
Return valid JSON only:
{{"score": <float 0.0-1.0>, "reason": "<one sentence explanation>"}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0,
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
score = float(result["score"])
return EvalResult(
metric_name=metric_name,
score=score,
passed=score >= threshold,
reason=result["reason"],
threshold=threshold
)Step 2: Curate a Golden Dataset
Your dataset is the foundation. Invest in quality over quantity — 50 well-crafted examples beat 500 sloppy ones.
# dataset.py
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class GoldenExample:
id: str
question: str
ground_truth: str
context_hint: Optional[str] = None # what context should be retrieved
tags: list[str] = None # "pricing", "features", "edge-case"
def to_dict(self):
return asdict(self)
GOLDEN_EXAMPLES = [
GoldenExample(
id="pricing-001",
question="What does HelpMeTest Pro cost?",
ground_truth="HelpMeTest Pro costs $100 per month and includes unlimited tests with parallel execution.",
tags=["pricing", "core"]
),
GoldenExample(
id="self-host-001",
question="Can I self-host HelpMeTest?",
ground_truth="No, HelpMeTest is cloud-hosted SaaS. Self-hosting is not available.",
tags=["deployment", "negative-answer", "edge-case"]
),
GoldenExample(
id="monitoring-001",
question="What monitoring interval does the free plan use?",
ground_truth="The free plan monitors every 5 minutes.",
tags=["monitoring", "pricing", "core"]
),
GoldenExample(
id="visual-testing-001",
question="What viewports does visual testing support?",
ground_truth="HelpMeTest visual testing supports mobile, tablet, and desktop viewports.",
tags=["features", "visual-testing"]
),
GoldenExample(
id="framework-001",
question="What testing framework does HelpMeTest use?",
ground_truth="HelpMeTest uses Robot Framework with Playwright for browser automation.",
tags=["features", "technical"]
),
GoldenExample(
id="health-check-001",
question="How do I set up health monitoring in HelpMeTest?",
ground_truth="Run: helpmetest health <name> <grace_period>. Grace period options: 30s, 5m, 2h, 1d.",
tags=["features", "how-to"]
),
GoldenExample(
id="enterprise-001",
question="What does the Enterprise plan include?",
ground_truth="Enterprise includes 10-second monitoring intervals, SSO, priority support, QA team outsourcing, and custom features.",
tags=["pricing", "enterprise"]
),
]
def save_dataset(examples: list[GoldenExample], path: str = "eval-dataset.json"):
with open(path, "w") as f:
json.dump([e.to_dict() for e in examples], f, indent=2)
def load_dataset(path: str = "eval-dataset.json") -> list[GoldenExample]:
with open(path) as f:
return [GoldenExample(**item) for item in json.load(f)]Step 3: Build the Evaluation Runner
# evaluator.py
import asyncio
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvalRun:
example_id: str
question: str
answer: str
context: list[str]
results: list[EvalResult]
@property
def passed(self) -> bool:
return all(r.passed for r in self.results)
@property
def scores(self) -> dict[str, float]:
return {r.metric_name: r.score for r in self.results}
class Evaluator:
def __init__(self, app_fn: Callable, metrics: list[Callable]):
"""
app_fn: callable that takes (question, context) -> (answer, retrieved_context)
metrics: list of metric functions that take TestCase -> EvalResult
"""
self.app_fn = app_fn
self.metrics = metrics
def evaluate_example(self, example: GoldenExample) -> EvalRun:
# Run the app
answer, context = self.app_fn(example.question)
# Build test case
test_case = TestCase(
id=example.id,
question=example.question,
context=context,
answer=answer,
ground_truth=example.ground_truth
)
# Run all metrics
results = []
for metric_fn in self.metrics:
try:
result = metric_fn(test_case)
results.append(result)
except Exception as e:
results.append(EvalResult(
metric_name=metric_fn.__name__,
score=0.0,
passed=False,
reason=f"Metric error: {e}",
threshold=0.0
))
return EvalRun(
example_id=example.id,
question=example.question,
answer=answer,
context=context,
results=results
)
def evaluate_dataset(
self,
dataset: list[GoldenExample],
max_workers: int = 4
) -> list[EvalRun]:
with ThreadPoolExecutor(max_workers=max_workers) as executor:
runs = list(executor.map(self.evaluate_example, dataset))
return runsStep 4: Results Storage and Trending
# results.py
import json
import sqlite3
from datetime import datetime
class ResultsStore:
def __init__(self, db_path: str = "eval-results.db"):
self.conn = sqlite3.connect(db_path)
self._create_tables()
def _create_tables(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS runs (
run_id TEXT PRIMARY KEY,
timestamp TEXT,
app_version TEXT,
example_id TEXT,
question TEXT,
answer TEXT,
passed INTEGER
)
""")
self.conn.execute("""
CREATE TABLE IF NOT EXISTS scores (
run_id TEXT,
metric_name TEXT,
score REAL,
passed INTEGER,
reason TEXT,
FOREIGN KEY (run_id) REFERENCES runs(run_id)
)
""")
self.conn.commit()
def save_run(self, run: EvalRun, app_version: str, run_prefix: str = ""):
run_id = f"{run_prefix}{run.example_id}-{datetime.utcnow().isoformat()}"
timestamp = datetime.utcnow().isoformat()
self.conn.execute(
"INSERT INTO runs VALUES (?, ?, ?, ?, ?, ?, ?)",
(run_id, timestamp, app_version, run.example_id,
run.question, run.answer, int(run.passed))
)
for result in run.results:
self.conn.execute(
"INSERT INTO scores VALUES (?, ?, ?, ?, ?)",
(run_id, result.metric_name, result.score,
int(result.passed), result.reason)
)
self.conn.commit()
def get_metric_trend(self, metric_name: str, limit: int = 20) -> list[dict]:
"""Get recent scores for a metric across all runs."""
cursor = self.conn.execute("""
SELECT r.timestamp, r.app_version, AVG(s.score) as avg_score
FROM scores s
JOIN runs r ON s.run_id = r.run_id
WHERE s.metric_name = ?
GROUP BY r.run_id, r.app_version, r.timestamp
ORDER BY r.timestamp DESC
LIMIT ?
""", (metric_name, limit))
return [{"timestamp": row[0], "version": row[1], "score": row[2]}
for row in cursor.fetchall()]
def get_failing_examples(self, metric_name: str, threshold: float) -> list[dict]:
cursor = self.conn.execute("""
SELECT r.example_id, r.question, r.answer, s.score, s.reason
FROM scores s
JOIN runs r ON s.run_id = r.run_id
WHERE s.metric_name = ? AND s.score < ?
ORDER BY s.score ASC
LIMIT 20
""", (metric_name, threshold))
cols = ["example_id", "question", "answer", "score", "reason"]
return [dict(zip(cols, row)) for row in cursor.fetchall()]Step 5: Reporting
# report.py
from tabulate import tabulate
import statistics
def print_report(runs: list[EvalRun]):
"""Print a human-readable evaluation report."""
print(f"\n{'='*60}")
print(f"EVALUATION REPORT — {len(runs)} examples")
print(f"{'='*60}\n")
# Aggregate by metric
metric_scores = {}
for run in runs:
for result in run.results:
if result.metric_name not in metric_scores:
metric_scores[result.metric_name] = []
metric_scores[result.metric_name].append(result.score)
table_data = []
for metric, scores in metric_scores.items():
avg = statistics.mean(scores)
std = statistics.stdev(scores) if len(scores) > 1 else 0
pass_rate = sum(1 for s in scores if s >= 0.7) / len(scores)
table_data.append([metric, f"{avg:.3f}", f"±{std:.3f}", f"{pass_rate:.0%}"])
print(tabulate(
table_data,
headers=["Metric", "Avg Score", "Std Dev", "Pass Rate"],
tablefmt="rounded_outline"
))
# Per-example failures
failed_runs = [r for r in runs if not r.passed]
if failed_runs:
print(f"\n{len(failed_runs)} FAILING EXAMPLES:\n")
for run in failed_runs:
print(f" [{run.example_id}] {run.question[:60]}")
for result in run.results:
if not result.passed:
print(f" ✗ {result.metric_name}: {result.score:.2f} — {result.reason}")
else:
print("\nAll examples passed.")
print()
return len(failed_runs) == 0Step 6: Wire It Together
# eval.py — the entry point
import sys
from metrics import (
check_response_length,
check_no_forbidden_phrases,
semantic_similarity,
)
from dataset import GOLDEN_EXAMPLES
from evaluator import Evaluator
from results import ResultsStore
from report import print_report
# Import your app
from myapp.chatbot import answer_question, retrieve_context
def run_app(question: str) -> tuple[str, list[str]]:
context = retrieve_context(question)
answer = answer_question(question, context)
return answer, context
# Compose metrics
# Rule-based (fast, free)
rule_metrics = [
check_response_length,
check_no_forbidden_phrases,
]
# Semantic (cheap)
from functools import partial
similarity_metric = partial(semantic_similarity, threshold=0.70)
similarity_metric.__name__ = "semantic_similarity"
# LLM-judged (accurate but costs money)
from metrics import llm_judge
from functools import partial
faithfulness_metric = partial(
llm_judge,
criterion="The answer only makes claims that are directly supported by the provided context. It does not invent facts.",
metric_name="faithfulness",
threshold=0.75
)
faithfulness_metric.__name__ = "faithfulness"
relevancy_metric = partial(
llm_judge,
criterion="The answer directly addresses the user's question. It is not off-topic or evasive.",
metric_name="answer_relevancy",
threshold=0.75
)
relevancy_metric.__name__ = "answer_relevancy"
# Run evaluation
evaluator = Evaluator(
app_fn=run_app,
metrics=[*rule_metrics, similarity_metric, faithfulness_metric, relevancy_metric]
)
store = ResultsStore()
app_version = sys.argv[1] if len(sys.argv) > 1 else "dev"
print(f"Evaluating {len(GOLDEN_EXAMPLES)} examples (version: {app_version})...")
runs = evaluator.evaluate_dataset(GOLDEN_EXAMPLES, max_workers=4)
# Save results
for run in runs:
store.save_run(run, app_version)
# Print report and exit
all_passed = print_report(runs)
sys.exit(0 if all_passed else 1)Step 7: CI Integration
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on:
push:
branches: [main]
pull_request:
paths:
- 'src/**'
- 'prompts/**'
- 'eval/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements.txt
- name: Run evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python eval.py ${{ github.sha }}
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval-results.dbCost Management
LLM-judged metrics cost money. Manage this:
# Run cheap metrics on every PR, expensive ones nightly
import os
ENV = os.environ.get("EVAL_MODE", "fast")
if ENV == "fast":
# PR checks: rule-based + semantic only
metrics = [*rule_metrics, similarity_metric]
elif ENV == "full":
# Nightly: full suite including LLM judges
metrics = [*rule_metrics, similarity_metric, faithfulness_metric, relevancy_metric]Nightly cron in GitHub Actions:
on:
schedule:
- cron: "0 3 * * *" # 3am UTC daily
push:
branches: [main]
jobs:
evaluate:
env:
EVAL_MODE: ${{ github.event_name == 'schedule' && 'full' || 'fast' }}Estimated Costs
For 50 golden examples with full evaluation:
| Metric type | Cost per run |
|---|---|
| Rule-based | $0 |
| Embedding similarity | ~$0.001 |
| LLM judge (GPT-4o-mini) | ~$0.05 |
| LLM judge (GPT-4o) | ~$0.30 |
Full nightly runs on GPT-4o: ~$9/month. Cheap compared to the cost of a quality regression.
What the Frameworks Give You for Free
Before committing to a custom build, re-evaluate what you'd miss:
| Feature | Build time | Frameworks provide |
|---|---|---|
| Core RAG metrics | ~2 days | DeepEval, Ragas — instant |
| Dataset management | ~1 day | LangSmith — full-featured |
| Dashboard | ~3 days | TruLens — built-in |
| Red-teaming | ~5 days | Promptfoo — comprehensive |
| Custom metrics | 0 | All support custom evaluators |
The right answer is often: use a framework for 80%, extend with custom metrics for the remaining 20%.
Next Steps
- Start with rule-based metrics — they're free and catch obvious failures
- Add one LLM-judged metric for faithfulness or relevancy
- Curate 20 golden examples covering your key use cases
- Wire into CI with fast mode — run expensive judges only nightly
- Check DeepEval before rolling your own — its G-Eval covers most custom criteria without code
For teams who want to run these evaluation pipelines on a schedule with alerting and trend dashboards — without building and maintaining the infrastructure — HelpMeTest runs your eval scripts as scheduled health checks and alerts when scores drop below threshold.