AgentBench and LLM Agent Evaluation: Setting Up Benchmarks & Custom Harnesses
Evaluating LLM agents is one of the hardest problems in AI engineering. Unlike traditional software, agents are non-deterministic — the same input can produce different outputs on different runs. AgentBench and related evaluation frameworks give you systematic approaches to measure agent capability and track improvements over time. This guide covers the practical side of setting up and running agent evaluations.
Why Standard Unit Tests Aren't Enough for LLM Agents
When you test a traditional software function, you expect deterministic outputs. When you test an LLM agent:
- The same prompt may yield different answers across runs
- "Correct" is often subjective or requires human judgment
- Performance degrades as context windows fill up
- Models change over time (updates, fine-tuning, API changes)
- Agent behavior can shift with temperature, system prompts, or tool availability
You need evaluation harnesses — frameworks designed specifically for measuring LLM agent performance at scale, with statistical aggregation across many runs.
What Is AgentBench?
AgentBench is an open-source benchmark framework designed to evaluate LLM agents on a diverse set of real-world tasks. It provides:
- Task environments — OS interactions, web browsing, database queries, knowledge graph navigation
- Standardized evaluation metrics — success rate, partial credit, efficiency scores
- Reproducible evaluation — Docker-based environments ensure consistent conditions
- Multi-model comparison — run the same tasks against different models or agent implementations
AgentBench covers 8 distinct environments:
- Operating system (bash commands)
- Database (SQL queries)
- Knowledge graph (SPARQL, entity traversal)
- Digital card game
- Lateral thinking puzzles
- House holding (simulated household tasks)
- Web shopping
- Web browser interaction
Setting Up AgentBench
Installation
# Clone AgentBench
git <span class="hljs-built_in">clone https://github.com/THUDM/AgentBench.git
<span class="hljs-built_in">cd AgentBench
<span class="hljs-comment"># Install dependencies
pip install -r requirements.txt
<span class="hljs-comment"># Set up Docker environments (required for OS and web tasks)
docker pull ubuntu:22.04
docker pull mysql:8.0Configuration
AgentBench uses YAML configuration files:
# configs/eval_my_agent.yaml
agent:
type: "openai"
config:
model: "gpt-4-turbo"
temperature: 0.0 # Use 0 for reproducibility
max_tokens: 2048
api_key: "${OPENAI_API_KEY}"
tasks:
- name: "os"
config_path: "configs/tasks/os.yaml"
max_workers: 4
- name: "db"
config_path: "configs/tasks/db.yaml"
max_workers: 2
evaluation:
output_dir: "results/my_agent_eval"
num_samples: 100 # Tasks per environment
repeat: 3 # Run each task 3 times for statistical stabilityRunning a Benchmark
# Run full benchmark
python main.py --config configs/eval_my_agent.yaml
<span class="hljs-comment"># Run specific environment only
python main.py --config configs/eval_my_agent.yaml --task os
<span class="hljs-comment"># Dry run to verify setup
python main.py --config configs/eval_my_agent.yaml --dry-runUnderstanding AgentBench Metrics
AgentBench reports several metrics. Understanding them matters for interpreting results correctly.
Success Rate
The fraction of tasks completed successfully. This is the primary metric but needs context:
Success Rate = Successful Tasks / Total TasksA 60% success rate on OS tasks means the agent correctly executed bash commands in 60% of test cases. But this single number hides important distribution information — is it failing on hard tasks or easy ones?
Partial Credit Scoring
Some tasks allow partial credit. An agent that nearly solves a database query gets more credit than one that produces a syntax error:
# Example: how AgentBench calculates partial scores for DB tasks
def calculate_db_score(predicted_result, expected_result):
if predicted_result == expected_result:
return 1.0 # Full credit
# Check column names match
if set(predicted_result.columns) == set(expected_result.columns):
# Check row overlap
overlap = len(set(predicted_result.rows) & set(expected_result.rows))
total = len(expected_result.rows)
return 0.5 + (0.5 * overlap / total)
return 0.0Efficiency Score
For tasks with multiple valid solutions, efficiency measures whether the agent solved it in a reasonable number of steps:
Efficiency = (Optimal Steps / Actual Steps) * Success RateBuilding a Custom Evaluation Harness
AgentBench's built-in environments may not cover your specific use case. Here's how to build a custom evaluation harness for your domain.
Defining Your Evaluation Tasks
# evals/tasks/customer_support_tasks.py
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvalTask:
task_id: str
input: str # What the agent receives
ground_truth: str | dict # What the correct answer is
scorer: Callable[[str, str], float] # How to score the agent's response
category: str # For grouping results
difficulty: str # easy, medium, hard
# Define your evaluation dataset
CUSTOMER_SUPPORT_TASKS = [
EvalTask(
task_id="cs-001",
input="I need to return my order #12345 which arrived broken",
ground_truth={"intent": "return_request", "order_id": "12345", "reason": "broken"},
scorer=lambda response, truth: score_intent_extraction(response, truth),
category="intent_extraction",
difficulty="easy"
),
EvalTask(
task_id="cs-002",
input="Can I change my delivery address after placing the order?",
ground_truth="address_change_inquiry",
scorer=lambda response, truth: 1.0 if truth in response.lower() else 0.0,
category="routing",
difficulty="easy"
),
# ... more tasks
]Writing Scorers
Scorers are the heart of your evaluation harness. They map agent responses to numeric scores (0.0–1.0).
# evals/scorers.py
import json
from typing import Any
def score_intent_extraction(response: str, expected: dict) -> float:
"""Scores structured extraction tasks."""
try:
parsed = json.loads(response)
except json.JSONDecodeError:
return 0.0
total_fields = len(expected)
correct_fields = sum(
1 for key, value in expected.items()
if parsed.get(key) == value
)
return correct_fields / total_fields
def score_with_llm_judge(response: str, expected: str, judge_llm) -> float:
"""Uses an LLM to judge whether a response is semantically correct.
Useful when exact string matching isn't appropriate."""
prompt = f"""Rate whether this agent response correctly answers the task.
Task context: {expected}
Agent response: {response}
Rate on a scale of 0 to 10, where:
- 0 = Completely wrong or unhelpful
- 5 = Partially correct
- 10 = Fully correct and well-explained
Return only a number."""
rating = judge_llm.invoke(prompt)
try:
score = float(rating.content.strip()) / 10.0
return max(0.0, min(1.0, score)) # Clamp to [0, 1]
except ValueError:
return 0.5 # Default to neutral if parse fails
def score_code_correctness(generated_code: str, test_cases: list[dict]) -> float:
"""Evaluates generated code by running test cases."""
import subprocess
passed = 0
for test in test_cases:
# Write code to temp file
with open("/tmp/eval_code.py", "w") as f:
f.write(generated_code)
f.write(f"\n\nresult = {test['call']}")
f.write(f"\nassert result == {repr(test['expected'])}, f'Got {{result}}'")
result = subprocess.run(
["python", "/tmp/eval_code.py"],
capture_output=True,
timeout=10
)
if result.returncode == 0:
passed += 1
return passed / len(test_cases) if test_cases else 0.0Running the Evaluation Harness
# evals/runner.py
import asyncio
import json
from datetime import datetime
from evals.tasks.customer_support_tasks import CUSTOMER_SUPPORT_TASKS
class EvaluationRunner:
def __init__(self, agent, tasks, output_file: str):
self.agent = agent
self.tasks = tasks
self.output_file = output_file
self.results = []
async def run_task(self, task) -> dict:
try:
response = await self.agent.handle(task.input)
score = task.scorer(response, task.ground_truth)
except Exception as e:
response = f"ERROR: {str(e)}"
score = 0.0
return {
"task_id": task.task_id,
"category": task.category,
"difficulty": task.difficulty,
"input": task.input,
"response": response,
"score": score,
"timestamp": datetime.utcnow().isoformat()
}
async def run_all(self, concurrency: int = 5):
semaphore = asyncio.Semaphore(concurrency)
async def run_with_limit(task):
async with semaphore:
return await self.run_task(task)
self.results = await asyncio.gather(
*[run_with_limit(task) for task in self.tasks]
)
self.save_results()
return self.summarize()
def save_results(self):
with open(self.output_file, "w") as f:
json.dump(self.results, f, indent=2)
def summarize(self) -> dict:
by_category = {}
by_difficulty = {}
for result in self.results:
cat = result["category"]
diff = result["difficulty"]
by_category.setdefault(cat, []).append(result["score"])
by_difficulty.setdefault(diff, []).append(result["score"])
return {
"overall_score": sum(r["score"] for r in self.results) / len(self.results),
"by_category": {
cat: sum(scores) / len(scores)
for cat, scores in by_category.items()
},
"by_difficulty": {
diff: sum(scores) / len(scores)
for diff, scores in by_difficulty.items()
},
"total_tasks": len(self.results),
"perfect_scores": sum(1 for r in self.results if r["score"] == 1.0)
}Tracking Performance Over Time
Benchmarks are only useful if you run them consistently and track trends. Build a simple performance tracker:
# evals/tracker.py
import json
import os
from datetime import datetime
class PerformanceTracker:
def __init__(self, history_file: str = "eval_history.json"):
self.history_file = history_file
self.history = self._load_history()
def _load_history(self) -> list:
if os.path.exists(self.history_file):
with open(self.history_file) as f:
return json.load(f)
return []
def record(self, summary: dict, model_version: str, commit_sha: str):
entry = {
"timestamp": datetime.utcnow().isoformat(),
"model_version": model_version,
"commit_sha": commit_sha,
**summary
}
self.history.append(entry)
with open(self.history_file, "w") as f:
json.dump(self.history, f, indent=2)
def check_regression(self, current_summary: dict, threshold: float = 0.05) -> list:
"""Returns list of regressions (categories that dropped by more than threshold)."""
if not self.history:
return []
last_run = self.history[-1]
regressions = []
for category, score in current_summary["by_category"].items():
last_score = last_run.get("by_category", {}).get(category, score)
if last_score - score > threshold:
regressions.append({
"category": category,
"previous": last_score,
"current": score,
"drop": last_score - score
})
return regressionsIntegrating Evals Into CI/CD
Run evaluations automatically on every significant change:
# .github/workflows/eval.yml
name: Agent Evaluation
on:
push:
branches: [main]
schedule:
- cron: '0 6 * * 1' # Weekly on Monday morning
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluation suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python -m evals.runner \
--model gpt-4-turbo \
--tasks customer_support \
--output results/eval_$(date +%Y%m%d).json
- name: Check for regressions
run: python -m evals.check_regressions --threshold 0.05
- name: Archive results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/LLM-as-Judge: When Human Evaluation Doesn't Scale
For tasks where there's no single correct answer (e.g., "write a helpful response"), use an LLM judge:
# evals/llm_judge.py
from langchain_openai import ChatOpenAI
JUDGE_SYSTEM_PROMPT = """You are an expert evaluator of AI agent responses.
Your job is to assess whether an agent response correctly and helpfully addresses a task.
Be rigorous but fair. Score objectively based on accuracy, completeness, and usefulness."""
class LLMJudge:
def __init__(self, model="gpt-4o"):
# Use a different (typically stronger) model as judge
self.llm = ChatOpenAI(model=model, temperature=0)
async def score(self, task: str, response: str, rubric: str) -> dict:
prompt = f"""Task: {task}
Agent Response: {response}
Evaluation Rubric: {rubric}
Provide your evaluation as JSON:
{{
"score": <0.0-1.0>,
"reasoning": "<brief explanation>",
"strengths": ["<what the response did well>"],
"weaknesses": ["<what could be improved>"]
}}"""
result = await self.llm.ainvoke([
{"role": "system", "content": JUDGE_SYSTEM_PROMPT},
{"role": "user", "content": prompt}
])
import json
return json.loads(result.content)Key Principles for Agent Evaluation
1. Run evaluations at multiple temperatures. A model that scores 80% at temperature 0 may score 65% at temperature 0.7. Know your production temperature.
2. Sample size matters. 10 tasks is not a benchmark. Aim for 100+ tasks per category for statistically meaningful results.
3. Separate evaluation from unit testing. Evals are expensive (LLM API calls, time). Run them on schedule or against significant changes — not on every commit.
4. Track trends, not just snapshots. A one-time 78% score is less useful than knowing you've improved from 62% to 78% over three months.
5. Use your actual production data. The most valuable eval tasks are based on real user queries that your agent struggled with. Collect failures from production and add them to your eval set.
6. Avoid data contamination. If your training data includes AgentBench tasks, your agent's scores are inflated. Use held-out test sets.
Continuous Monitoring Beyond CI Evals
Scheduled evaluations catch regressions between releases. But production LLM agent performance can also degrade due to API changes, model updates by providers, or slow shifts in user query patterns.
Setting up end-to-end monitoring that regularly exercises your agent with known inputs and validates responses against expected outputs — separately from your CI eval suite — gives you a real-time signal. Platforms like HelpMeTest can run these validation workflows on a schedule with automated alerting, so you find out about performance degradation before your users do.
Summary
Evaluating LLM agents requires purpose-built tools and frameworks:
- AgentBench for standardized benchmarks across common task types
- Custom evaluation harnesses for domain-specific task evaluation
- LLM-as-judge for tasks without clear-cut correct answers
- Performance tracking to detect regressions over time
- CI/CD integration for automated evaluation on significant changes
Start with a small but representative eval set, write robust scorers, and run evaluations on a schedule. The signal from consistent evaluation is worth more than any single benchmark score.