Building an LLM Evaluation Framework from Scratch: Metrics, Datasets, and CI

Building an LLM Evaluation Framework from Scratch: Metrics, Datasets, and CI

Off-the-shelf eval frameworks (DeepEval, Ragas, TruLens) cover 80% of use cases. For the remaining 20% — specialized domains, proprietary metrics, internal tooling constraints — you need to build your own. This guide walks you through designing metrics, curating datasets, implementing an LLM judge, and wiring it all into CI.


When to Build Your Own

Use DeepEval, Ragas, or TruLens if their built-in metrics cover your needs. Build your own when:

  • Domain-specific correctness — "correct" in your domain can't be captured by generic faithfulness or relevancy scores
  • Proprietary constraints — you can't send data to third-party eval APIs (medical records, legal documents)
  • Custom scoring logic — combining multiple signals (rule-based + LLM-judged + human feedback) in ways frameworks don't support
  • Internal tool integration — you need scores flowing into your own dashboards, alerting, or reporting infrastructure

Framework Architecture

A minimal eval framework has four components:

┌─────────────────┐
│   Test Dataset  │  Golden QA pairs, edge cases, adversarial examples
└────────┬────────┘
         │
┌────────▼────────┐
│  System Under   │  Your LLM app — retriever + generator + tools
│     Test        │
└────────┬────────┘
         │
┌────────▼────────┐
│    Evaluators   │  Rule-based, embedding-based, LLM-judged
└────────┬────────┘
         │
┌────────▼────────┐
│  Results Store  │  Scores, history, trend data
└─────────────────┘

Step 1: Define Your Metrics

Before writing code, write down what "correct" means for each metric. Vague metrics produce noisy scores.

# metrics.py
from dataclasses import dataclass
from typing import Optional

@dataclass
class EvalResult:
    metric_name: str
    score: float           # 0.0 to 1.0
    passed: bool
    reason: str
    threshold: float

@dataclass  
class TestCase:
    id: str
    question: str
    context: list[str]    # retrieved documents
    answer: str           # model output
    ground_truth: Optional[str] = None
    metadata: dict = None

Metric Types

1. Rule-Based Metrics — fast, deterministic, no LLM cost:

def check_response_length(test_case: TestCase, min_words=10, max_words=200) -> EvalResult:
    words = len(test_case.answer.split())
    passed = min_words <= words <= max_words
    return EvalResult(
        metric_name="response_length",
        score=1.0 if passed else 0.0,
        passed=passed,
        reason=f"Response is {words} words (expected {min_words}-{max_words})",
        threshold=1.0
    )

def check_no_forbidden_phrases(test_case: TestCase) -> EvalResult:
    FORBIDDEN = ["I cannot", "As an AI", "self-host", "deploy yourself"]
    violations = [p for p in FORBIDDEN if p.lower() in test_case.answer.lower()]
    
    return EvalResult(
        metric_name="no_forbidden_phrases",
        score=0.0 if violations else 1.0,
        passed=len(violations) == 0,
        reason=f"Found forbidden phrases: {violations}" if violations else "No forbidden phrases",
        threshold=1.0
    )

2. Embedding-Based Metrics — semantic similarity without LLM judge cost:

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def semantic_similarity(test_case: TestCase, threshold=0.75) -> EvalResult:
    if not test_case.ground_truth:
        return EvalResult(
            metric_name="semantic_similarity",
            score=0.0,
            passed=False,
            reason="No ground truth provided",
            threshold=threshold
        )
    
    answer_emb = get_embedding(test_case.answer)
    truth_emb = get_embedding(test_case.ground_truth)
    score = cosine_similarity(answer_emb, truth_emb)
    
    return EvalResult(
        metric_name="semantic_similarity",
        score=score,
        passed=score >= threshold,
        reason=f"Semantic similarity: {score:.2f}",
        threshold=threshold
    )

3. LLM-as-Judge — flexible, high-quality, higher cost:

import json

def llm_judge(
    test_case: TestCase,
    criterion: str,
    metric_name: str,
    threshold: float = 0.7
) -> EvalResult:
    """Generic LLM judge with structured output."""
    
    prompt = f"""You are an evaluator for an LLM application. Score the following response.

CRITERION: {criterion}

QUESTION: {test_case.question}

CONTEXT PROVIDED TO MODEL:
{chr(10).join(f'- {c}' for c in test_case.context)}

MODEL RESPONSE:
{test_case.answer}

Score the response on a scale of 0.0 to 1.0 based on the criterion above.
Return valid JSON only:
{{"score": <float 0.0-1.0>, "reason": "<one sentence explanation>"}}"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    
    result = json.loads(response.choices[0].message.content)
    score = float(result["score"])
    
    return EvalResult(
        metric_name=metric_name,
        score=score,
        passed=score >= threshold,
        reason=result["reason"],
        threshold=threshold
    )

Step 2: Curate a Golden Dataset

Your dataset is the foundation. Invest in quality over quantity — 50 well-crafted examples beat 500 sloppy ones.

# dataset.py
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class GoldenExample:
    id: str
    question: str
    ground_truth: str
    context_hint: Optional[str] = None  # what context should be retrieved
    tags: list[str] = None              # "pricing", "features", "edge-case"
    
    def to_dict(self):
        return asdict(self)

GOLDEN_EXAMPLES = [
    GoldenExample(
        id="pricing-001",
        question="What does HelpMeTest Pro cost?",
        ground_truth="HelpMeTest Pro costs $100 per month and includes unlimited tests with parallel execution.",
        tags=["pricing", "core"]
    ),
    GoldenExample(
        id="self-host-001",
        question="Can I self-host HelpMeTest?",
        ground_truth="No, HelpMeTest is cloud-hosted SaaS. Self-hosting is not available.",
        tags=["deployment", "negative-answer", "edge-case"]
    ),
    GoldenExample(
        id="monitoring-001",
        question="What monitoring interval does the free plan use?",
        ground_truth="The free plan monitors every 5 minutes.",
        tags=["monitoring", "pricing", "core"]
    ),
    GoldenExample(
        id="visual-testing-001",
        question="What viewports does visual testing support?",
        ground_truth="HelpMeTest visual testing supports mobile, tablet, and desktop viewports.",
        tags=["features", "visual-testing"]
    ),
    GoldenExample(
        id="framework-001",
        question="What testing framework does HelpMeTest use?",
        ground_truth="HelpMeTest uses Robot Framework with Playwright for browser automation.",
        tags=["features", "technical"]
    ),
    GoldenExample(
        id="health-check-001",
        question="How do I set up health monitoring in HelpMeTest?",
        ground_truth="Run: helpmetest health <name> <grace_period>. Grace period options: 30s, 5m, 2h, 1d.",
        tags=["features", "how-to"]
    ),
    GoldenExample(
        id="enterprise-001",
        question="What does the Enterprise plan include?",
        ground_truth="Enterprise includes 10-second monitoring intervals, SSO, priority support, QA team outsourcing, and custom features.",
        tags=["pricing", "enterprise"]
    ),
]

def save_dataset(examples: list[GoldenExample], path: str = "eval-dataset.json"):
    with open(path, "w") as f:
        json.dump([e.to_dict() for e in examples], f, indent=2)

def load_dataset(path: str = "eval-dataset.json") -> list[GoldenExample]:
    with open(path) as f:
        return [GoldenExample(**item) for item in json.load(f)]

Step 3: Build the Evaluation Runner

# evaluator.py
import asyncio
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalRun:
    example_id: str
    question: str
    answer: str
    context: list[str]
    results: list[EvalResult]
    
    @property
    def passed(self) -> bool:
        return all(r.passed for r in self.results)
    
    @property
    def scores(self) -> dict[str, float]:
        return {r.metric_name: r.score for r in self.results}


class Evaluator:
    def __init__(self, app_fn: Callable, metrics: list[Callable]):
        """
        app_fn: callable that takes (question, context) -> (answer, retrieved_context)
        metrics: list of metric functions that take TestCase -> EvalResult
        """
        self.app_fn = app_fn
        self.metrics = metrics
    
    def evaluate_example(self, example: GoldenExample) -> EvalRun:
        # Run the app
        answer, context = self.app_fn(example.question)
        
        # Build test case
        test_case = TestCase(
            id=example.id,
            question=example.question,
            context=context,
            answer=answer,
            ground_truth=example.ground_truth
        )
        
        # Run all metrics
        results = []
        for metric_fn in self.metrics:
            try:
                result = metric_fn(test_case)
                results.append(result)
            except Exception as e:
                results.append(EvalResult(
                    metric_name=metric_fn.__name__,
                    score=0.0,
                    passed=False,
                    reason=f"Metric error: {e}",
                    threshold=0.0
                ))
        
        return EvalRun(
            example_id=example.id,
            question=example.question,
            answer=answer,
            context=context,
            results=results
        )
    
    def evaluate_dataset(
        self, 
        dataset: list[GoldenExample],
        max_workers: int = 4
    ) -> list[EvalRun]:
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            runs = list(executor.map(self.evaluate_example, dataset))
        return runs

# results.py
import json
import sqlite3
from datetime import datetime

class ResultsStore:
    def __init__(self, db_path: str = "eval-results.db"):
        self.conn = sqlite3.connect(db_path)
        self._create_tables()
    
    def _create_tables(self):
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS runs (
                run_id TEXT PRIMARY KEY,
                timestamp TEXT,
                app_version TEXT,
                example_id TEXT,
                question TEXT,
                answer TEXT,
                passed INTEGER
            )
        """)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS scores (
                run_id TEXT,
                metric_name TEXT,
                score REAL,
                passed INTEGER,
                reason TEXT,
                FOREIGN KEY (run_id) REFERENCES runs(run_id)
            )
        """)
        self.conn.commit()
    
    def save_run(self, run: EvalRun, app_version: str, run_prefix: str = ""):
        run_id = f"{run_prefix}{run.example_id}-{datetime.utcnow().isoformat()}"
        timestamp = datetime.utcnow().isoformat()
        
        self.conn.execute(
            "INSERT INTO runs VALUES (?, ?, ?, ?, ?, ?, ?)",
            (run_id, timestamp, app_version, run.example_id, 
             run.question, run.answer, int(run.passed))
        )
        
        for result in run.results:
            self.conn.execute(
                "INSERT INTO scores VALUES (?, ?, ?, ?, ?)",
                (run_id, result.metric_name, result.score, 
                 int(result.passed), result.reason)
            )
        
        self.conn.commit()
    
    def get_metric_trend(self, metric_name: str, limit: int = 20) -> list[dict]:
        """Get recent scores for a metric across all runs."""
        cursor = self.conn.execute("""
            SELECT r.timestamp, r.app_version, AVG(s.score) as avg_score
            FROM scores s
            JOIN runs r ON s.run_id = r.run_id
            WHERE s.metric_name = ?
            GROUP BY r.run_id, r.app_version, r.timestamp
            ORDER BY r.timestamp DESC
            LIMIT ?
        """, (metric_name, limit))
        
        return [{"timestamp": row[0], "version": row[1], "score": row[2]} 
                for row in cursor.fetchall()]
    
    def get_failing_examples(self, metric_name: str, threshold: float) -> list[dict]:
        cursor = self.conn.execute("""
            SELECT r.example_id, r.question, r.answer, s.score, s.reason
            FROM scores s
            JOIN runs r ON s.run_id = r.run_id
            WHERE s.metric_name = ? AND s.score < ?
            ORDER BY s.score ASC
            LIMIT 20
        """, (metric_name, threshold))
        
        cols = ["example_id", "question", "answer", "score", "reason"]
        return [dict(zip(cols, row)) for row in cursor.fetchall()]

Step 5: Reporting

# report.py
from tabulate import tabulate
import statistics

def print_report(runs: list[EvalRun]):
    """Print a human-readable evaluation report."""
    
    print(f"\n{'='*60}")
    print(f"EVALUATION REPORT — {len(runs)} examples")
    print(f"{'='*60}\n")
    
    # Aggregate by metric
    metric_scores = {}
    for run in runs:
        for result in run.results:
            if result.metric_name not in metric_scores:
                metric_scores[result.metric_name] = []
            metric_scores[result.metric_name].append(result.score)
    
    table_data = []
    for metric, scores in metric_scores.items():
        avg = statistics.mean(scores)
        std = statistics.stdev(scores) if len(scores) > 1 else 0
        pass_rate = sum(1 for s in scores if s >= 0.7) / len(scores)
        table_data.append([metric, f"{avg:.3f}", f"±{std:.3f}", f"{pass_rate:.0%}"])
    
    print(tabulate(
        table_data,
        headers=["Metric", "Avg Score", "Std Dev", "Pass Rate"],
        tablefmt="rounded_outline"
    ))
    
    # Per-example failures
    failed_runs = [r for r in runs if not r.passed]
    if failed_runs:
        print(f"\n{len(failed_runs)} FAILING EXAMPLES:\n")
        for run in failed_runs:
            print(f"  [{run.example_id}] {run.question[:60]}")
            for result in run.results:
                if not result.passed:
                    print(f"    ✗ {result.metric_name}: {result.score:.2f}{result.reason}")
    else:
        print("\nAll examples passed.")
    
    print()
    return len(failed_runs) == 0

Step 6: Wire It Together

# eval.py — the entry point
import sys
from metrics import (
    check_response_length,
    check_no_forbidden_phrases,
    semantic_similarity,
)
from dataset import GOLDEN_EXAMPLES
from evaluator import Evaluator
from results import ResultsStore
from report import print_report

# Import your app
from myapp.chatbot import answer_question, retrieve_context

def run_app(question: str) -> tuple[str, list[str]]:
    context = retrieve_context(question)
    answer = answer_question(question, context)
    return answer, context

# Compose metrics
# Rule-based (fast, free)
rule_metrics = [
    check_response_length,
    check_no_forbidden_phrases,
]

# Semantic (cheap)
from functools import partial
similarity_metric = partial(semantic_similarity, threshold=0.70)
similarity_metric.__name__ = "semantic_similarity"

# LLM-judged (accurate but costs money)
from metrics import llm_judge
from functools import partial

faithfulness_metric = partial(
    llm_judge,
    criterion="The answer only makes claims that are directly supported by the provided context. It does not invent facts.",
    metric_name="faithfulness",
    threshold=0.75
)
faithfulness_metric.__name__ = "faithfulness"

relevancy_metric = partial(
    llm_judge,
    criterion="The answer directly addresses the user's question. It is not off-topic or evasive.",
    metric_name="answer_relevancy",
    threshold=0.75
)
relevancy_metric.__name__ = "answer_relevancy"

# Run evaluation
evaluator = Evaluator(
    app_fn=run_app,
    metrics=[*rule_metrics, similarity_metric, faithfulness_metric, relevancy_metric]
)

store = ResultsStore()
app_version = sys.argv[1] if len(sys.argv) > 1 else "dev"

print(f"Evaluating {len(GOLDEN_EXAMPLES)} examples (version: {app_version})...")
runs = evaluator.evaluate_dataset(GOLDEN_EXAMPLES, max_workers=4)

# Save results
for run in runs:
    store.save_run(run, app_version)

# Print report and exit
all_passed = print_report(runs)
sys.exit(0 if all_passed else 1)

Step 7: CI Integration

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  push:
    branches: [main]
  pull_request:
    paths:
      - 'src/**'
      - 'prompts/**'
      - 'eval/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      
      - run: pip install -r requirements.txt
      
      - name: Run evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python eval.py ${{ github.sha }}
      
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval-results.db

Cost Management

LLM-judged metrics cost money. Manage this:

# Run cheap metrics on every PR, expensive ones nightly
import os

ENV = os.environ.get("EVAL_MODE", "fast")

if ENV == "fast":
    # PR checks: rule-based + semantic only
    metrics = [*rule_metrics, similarity_metric]
elif ENV == "full":
    # Nightly: full suite including LLM judges
    metrics = [*rule_metrics, similarity_metric, faithfulness_metric, relevancy_metric]

Nightly cron in GitHub Actions:

on:
  schedule:
    - cron: "0 3 * * *"  # 3am UTC daily
  push:
    branches: [main]

jobs:
  evaluate:
    env:
      EVAL_MODE: ${{ github.event_name == 'schedule' && 'full' || 'fast' }}

Estimated Costs

For 50 golden examples with full evaluation:

Metric type Cost per run
Rule-based $0
Embedding similarity ~$0.001
LLM judge (GPT-4o-mini) ~$0.05
LLM judge (GPT-4o) ~$0.30

Full nightly runs on GPT-4o: ~$9/month. Cheap compared to the cost of a quality regression.


What the Frameworks Give You for Free

Before committing to a custom build, re-evaluate what you'd miss:

Feature Build time Frameworks provide
Core RAG metrics ~2 days DeepEval, Ragas — instant
Dataset management ~1 day LangSmith — full-featured
Dashboard ~3 days TruLens — built-in
Red-teaming ~5 days Promptfoo — comprehensive
Custom metrics 0 All support custom evaluators

The right answer is often: use a framework for 80%, extend with custom metrics for the remaining 20%.


Next Steps

  • Start with rule-based metrics — they're free and catch obvious failures
  • Add one LLM-judged metric for faithfulness or relevancy
  • Curate 20 golden examples covering your key use cases
  • Wire into CI with fast mode — run expensive judges only nightly
  • Check DeepEval before rolling your own — its G-Eval covers most custom criteria without code

For teams who want to run these evaluation pipelines on a schedule with alerting and trend dashboards — without building and maintaining the infrastructure — HelpMeTest runs your eval scripts as scheduled health checks and alerts when scores drop below threshold.

Read more