AgentBench and LLM Agent Evaluation: Setting Up Benchmarks & Custom Harnesses

AgentBench and LLM Agent Evaluation: Setting Up Benchmarks & Custom Harnesses

Evaluating LLM agents is one of the hardest problems in AI engineering. Unlike traditional software, agents are non-deterministic — the same input can produce different outputs on different runs. AgentBench and related evaluation frameworks give you systematic approaches to measure agent capability and track improvements over time. This guide covers the practical side of setting up and running agent evaluations.

Why Standard Unit Tests Aren't Enough for LLM Agents

When you test a traditional software function, you expect deterministic outputs. When you test an LLM agent:

  • The same prompt may yield different answers across runs
  • "Correct" is often subjective or requires human judgment
  • Performance degrades as context windows fill up
  • Models change over time (updates, fine-tuning, API changes)
  • Agent behavior can shift with temperature, system prompts, or tool availability

You need evaluation harnesses — frameworks designed specifically for measuring LLM agent performance at scale, with statistical aggregation across many runs.

What Is AgentBench?

AgentBench is an open-source benchmark framework designed to evaluate LLM agents on a diverse set of real-world tasks. It provides:

  • Task environments — OS interactions, web browsing, database queries, knowledge graph navigation
  • Standardized evaluation metrics — success rate, partial credit, efficiency scores
  • Reproducible evaluation — Docker-based environments ensure consistent conditions
  • Multi-model comparison — run the same tasks against different models or agent implementations

AgentBench covers 8 distinct environments:

  1. Operating system (bash commands)
  2. Database (SQL queries)
  3. Knowledge graph (SPARQL, entity traversal)
  4. Digital card game
  5. Lateral thinking puzzles
  6. House holding (simulated household tasks)
  7. Web shopping
  8. Web browser interaction

Setting Up AgentBench

Installation

# Clone AgentBench
git <span class="hljs-built_in">clone https://github.com/THUDM/AgentBench.git
<span class="hljs-built_in">cd AgentBench

<span class="hljs-comment"># Install dependencies
pip install -r requirements.txt

<span class="hljs-comment"># Set up Docker environments (required for OS and web tasks)
docker pull ubuntu:22.04
docker pull mysql:8.0

Configuration

AgentBench uses YAML configuration files:

# configs/eval_my_agent.yaml
agent:
  type: "openai"
  config:
    model: "gpt-4-turbo"
    temperature: 0.0  # Use 0 for reproducibility
    max_tokens: 2048
    api_key: "${OPENAI_API_KEY}"

tasks:
  - name: "os"
    config_path: "configs/tasks/os.yaml"
    max_workers: 4
    
  - name: "db"
    config_path: "configs/tasks/db.yaml"
    max_workers: 2

evaluation:
  output_dir: "results/my_agent_eval"
  num_samples: 100  # Tasks per environment
  repeat: 3  # Run each task 3 times for statistical stability

Running a Benchmark

# Run full benchmark
python main.py --config configs/eval_my_agent.yaml

<span class="hljs-comment"># Run specific environment only
python main.py --config configs/eval_my_agent.yaml --task os

<span class="hljs-comment"># Dry run to verify setup
python main.py --config configs/eval_my_agent.yaml --dry-run

Understanding AgentBench Metrics

AgentBench reports several metrics. Understanding them matters for interpreting results correctly.

Success Rate

The fraction of tasks completed successfully. This is the primary metric but needs context:

Success Rate = Successful Tasks / Total Tasks

A 60% success rate on OS tasks means the agent correctly executed bash commands in 60% of test cases. But this single number hides important distribution information — is it failing on hard tasks or easy ones?

Partial Credit Scoring

Some tasks allow partial credit. An agent that nearly solves a database query gets more credit than one that produces a syntax error:

# Example: how AgentBench calculates partial scores for DB tasks
def calculate_db_score(predicted_result, expected_result):
    if predicted_result == expected_result:
        return 1.0  # Full credit
    
    # Check column names match
    if set(predicted_result.columns) == set(expected_result.columns):
        # Check row overlap
        overlap = len(set(predicted_result.rows) & set(expected_result.rows))
        total = len(expected_result.rows)
        return 0.5 + (0.5 * overlap / total)
    
    return 0.0

Efficiency Score

For tasks with multiple valid solutions, efficiency measures whether the agent solved it in a reasonable number of steps:

Efficiency = (Optimal Steps / Actual Steps) * Success Rate

Building a Custom Evaluation Harness

AgentBench's built-in environments may not cover your specific use case. Here's how to build a custom evaluation harness for your domain.

Defining Your Evaluation Tasks

# evals/tasks/customer_support_tasks.py
from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalTask:
    task_id: str
    input: str                          # What the agent receives
    ground_truth: str | dict            # What the correct answer is
    scorer: Callable[[str, str], float] # How to score the agent's response
    category: str                       # For grouping results
    difficulty: str                     # easy, medium, hard

# Define your evaluation dataset
CUSTOMER_SUPPORT_TASKS = [
    EvalTask(
        task_id="cs-001",
        input="I need to return my order #12345 which arrived broken",
        ground_truth={"intent": "return_request", "order_id": "12345", "reason": "broken"},
        scorer=lambda response, truth: score_intent_extraction(response, truth),
        category="intent_extraction",
        difficulty="easy"
    ),
    EvalTask(
        task_id="cs-002",
        input="Can I change my delivery address after placing the order?",
        ground_truth="address_change_inquiry",
        scorer=lambda response, truth: 1.0 if truth in response.lower() else 0.0,
        category="routing",
        difficulty="easy"
    ),
    # ... more tasks
]

Writing Scorers

Scorers are the heart of your evaluation harness. They map agent responses to numeric scores (0.0–1.0).

# evals/scorers.py
import json
from typing import Any

def score_intent_extraction(response: str, expected: dict) -> float:
    """Scores structured extraction tasks."""
    try:
        parsed = json.loads(response)
    except json.JSONDecodeError:
        return 0.0
    
    total_fields = len(expected)
    correct_fields = sum(
        1 for key, value in expected.items()
        if parsed.get(key) == value
    )
    return correct_fields / total_fields

def score_with_llm_judge(response: str, expected: str, judge_llm) -> float:
    """Uses an LLM to judge whether a response is semantically correct.
    Useful when exact string matching isn't appropriate."""
    
    prompt = f"""Rate whether this agent response correctly answers the task.
    
Task context: {expected}
Agent response: {response}

Rate on a scale of 0 to 10, where:
- 0 = Completely wrong or unhelpful
- 5 = Partially correct
- 10 = Fully correct and well-explained

Return only a number."""
    
    rating = judge_llm.invoke(prompt)
    try:
        score = float(rating.content.strip()) / 10.0
        return max(0.0, min(1.0, score))  # Clamp to [0, 1]
    except ValueError:
        return 0.5  # Default to neutral if parse fails

def score_code_correctness(generated_code: str, test_cases: list[dict]) -> float:
    """Evaluates generated code by running test cases."""
    import subprocess
    
    passed = 0
    for test in test_cases:
        # Write code to temp file
        with open("/tmp/eval_code.py", "w") as f:
            f.write(generated_code)
            f.write(f"\n\nresult = {test['call']}")
            f.write(f"\nassert result == {repr(test['expected'])}, f'Got {{result}}'")
        
        result = subprocess.run(
            ["python", "/tmp/eval_code.py"],
            capture_output=True,
            timeout=10
        )
        if result.returncode == 0:
            passed += 1
    
    return passed / len(test_cases) if test_cases else 0.0

Running the Evaluation Harness

# evals/runner.py
import asyncio
import json
from datetime import datetime
from evals.tasks.customer_support_tasks import CUSTOMER_SUPPORT_TASKS

class EvaluationRunner:
    def __init__(self, agent, tasks, output_file: str):
        self.agent = agent
        self.tasks = tasks
        self.output_file = output_file
        self.results = []
    
    async def run_task(self, task) -> dict:
        try:
            response = await self.agent.handle(task.input)
            score = task.scorer(response, task.ground_truth)
        except Exception as e:
            response = f"ERROR: {str(e)}"
            score = 0.0
        
        return {
            "task_id": task.task_id,
            "category": task.category,
            "difficulty": task.difficulty,
            "input": task.input,
            "response": response,
            "score": score,
            "timestamp": datetime.utcnow().isoformat()
        }
    
    async def run_all(self, concurrency: int = 5):
        semaphore = asyncio.Semaphore(concurrency)
        
        async def run_with_limit(task):
            async with semaphore:
                return await self.run_task(task)
        
        self.results = await asyncio.gather(
            *[run_with_limit(task) for task in self.tasks]
        )
        
        self.save_results()
        return self.summarize()
    
    def save_results(self):
        with open(self.output_file, "w") as f:
            json.dump(self.results, f, indent=2)
    
    def summarize(self) -> dict:
        by_category = {}
        by_difficulty = {}
        
        for result in self.results:
            cat = result["category"]
            diff = result["difficulty"]
            
            by_category.setdefault(cat, []).append(result["score"])
            by_difficulty.setdefault(diff, []).append(result["score"])
        
        return {
            "overall_score": sum(r["score"] for r in self.results) / len(self.results),
            "by_category": {
                cat: sum(scores) / len(scores)
                for cat, scores in by_category.items()
            },
            "by_difficulty": {
                diff: sum(scores) / len(scores)
                for diff, scores in by_difficulty.items()
            },
            "total_tasks": len(self.results),
            "perfect_scores": sum(1 for r in self.results if r["score"] == 1.0)
        }

Tracking Performance Over Time

Benchmarks are only useful if you run them consistently and track trends. Build a simple performance tracker:

# evals/tracker.py
import json
import os
from datetime import datetime

class PerformanceTracker:
    def __init__(self, history_file: str = "eval_history.json"):
        self.history_file = history_file
        self.history = self._load_history()
    
    def _load_history(self) -> list:
        if os.path.exists(self.history_file):
            with open(self.history_file) as f:
                return json.load(f)
        return []
    
    def record(self, summary: dict, model_version: str, commit_sha: str):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "model_version": model_version,
            "commit_sha": commit_sha,
            **summary
        }
        self.history.append(entry)
        with open(self.history_file, "w") as f:
            json.dump(self.history, f, indent=2)
    
    def check_regression(self, current_summary: dict, threshold: float = 0.05) -> list:
        """Returns list of regressions (categories that dropped by more than threshold)."""
        if not self.history:
            return []
        
        last_run = self.history[-1]
        regressions = []
        
        for category, score in current_summary["by_category"].items():
            last_score = last_run.get("by_category", {}).get(category, score)
            if last_score - score > threshold:
                regressions.append({
                    "category": category,
                    "previous": last_score,
                    "current": score,
                    "drop": last_score - score
                })
        
        return regressions

Integrating Evals Into CI/CD

Run evaluations automatically on every significant change:

# .github/workflows/eval.yml
name: Agent Evaluation

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 6 * * 1'  # Weekly on Monday morning

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      
      - name: Install dependencies
        run: pip install -r requirements.txt
      
      - name: Run evaluation suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python -m evals.runner \
            --model gpt-4-turbo \
            --tasks customer_support \
            --output results/eval_$(date +%Y%m%d).json
      
      - name: Check for regressions
        run: python -m evals.check_regressions --threshold 0.05
      
      - name: Archive results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

LLM-as-Judge: When Human Evaluation Doesn't Scale

For tasks where there's no single correct answer (e.g., "write a helpful response"), use an LLM judge:

# evals/llm_judge.py
from langchain_openai import ChatOpenAI

JUDGE_SYSTEM_PROMPT = """You are an expert evaluator of AI agent responses.
Your job is to assess whether an agent response correctly and helpfully addresses a task.
Be rigorous but fair. Score objectively based on accuracy, completeness, and usefulness."""

class LLMJudge:
    def __init__(self, model="gpt-4o"):
        # Use a different (typically stronger) model as judge
        self.llm = ChatOpenAI(model=model, temperature=0)
    
    async def score(self, task: str, response: str, rubric: str) -> dict:
        prompt = f"""Task: {task}

Agent Response: {response}

Evaluation Rubric: {rubric}

Provide your evaluation as JSON:
{{
  "score": <0.0-1.0>,
  "reasoning": "<brief explanation>",
  "strengths": ["<what the response did well>"],
  "weaknesses": ["<what could be improved>"]
}}"""
        
        result = await self.llm.ainvoke([
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ])
        
        import json
        return json.loads(result.content)

Key Principles for Agent Evaluation

1. Run evaluations at multiple temperatures. A model that scores 80% at temperature 0 may score 65% at temperature 0.7. Know your production temperature.

2. Sample size matters. 10 tasks is not a benchmark. Aim for 100+ tasks per category for statistically meaningful results.

3. Separate evaluation from unit testing. Evals are expensive (LLM API calls, time). Run them on schedule or against significant changes — not on every commit.

4. Track trends, not just snapshots. A one-time 78% score is less useful than knowing you've improved from 62% to 78% over three months.

5. Use your actual production data. The most valuable eval tasks are based on real user queries that your agent struggled with. Collect failures from production and add them to your eval set.

6. Avoid data contamination. If your training data includes AgentBench tasks, your agent's scores are inflated. Use held-out test sets.

Continuous Monitoring Beyond CI Evals

Scheduled evaluations catch regressions between releases. But production LLM agent performance can also degrade due to API changes, model updates by providers, or slow shifts in user query patterns.

Setting up end-to-end monitoring that regularly exercises your agent with known inputs and validates responses against expected outputs — separately from your CI eval suite — gives you a real-time signal. Platforms like HelpMeTest can run these validation workflows on a schedule with automated alerting, so you find out about performance degradation before your users do.

Summary

Evaluating LLM agents requires purpose-built tools and frameworks:

  • AgentBench for standardized benchmarks across common task types
  • Custom evaluation harnesses for domain-specific task evaluation
  • LLM-as-judge for tasks without clear-cut correct answers
  • Performance tracking to detect regressions over time
  • CI/CD integration for automated evaluation on significant changes

Start with a small but representative eval set, write robust scorers, and run evaluations on a schedule. The signal from consistent evaluation is worth more than any single benchmark score.

Read more