Agent Evaluation Frameworks: RAGAS, PromptFoo, and LangSmith for Agentic Pipelines
Evaluating a traditional machine learning model is a solved problem. You have a test set, a metric, and a number. Evaluating an agentic AI pipeline is not solved, and the gap between the two is larger than most teams expect when they first try to apply standard ML evaluation thinking to agents.
An agent takes multi-step actions. It calls tools. It reasons across multiple LLM invocations. It can be right in the final answer while being wrong in the reasoning path, or wrong in the final answer despite correct intermediate steps. Standard accuracy metrics tell you almost nothing useful about this.
This guide covers the evaluation frameworks that actually work for agentic pipelines: RAGAS for retrieval-augmented agents, PromptFoo for prompt-level regression testing, and LangSmith evaluators for online and offline agent evaluation. We will also cover how to build ground truth datasets and how to detect hallucinations in multi-step pipelines.
Why LLM Evaluation Differs from Agent Evaluation
For a single LLM call, you can evaluate:
- Output quality: Does the response answer the question?
- Faithfulness: Is the response grounded in provided context?
- Format compliance: Is the output structured as expected?
For an agent, you additionally need to evaluate:
- Trajectory correctness: Did the agent take the right sequence of steps?
- Tool selection accuracy: Did it use the right tools in the right order?
- Goal completion: Did the agent actually accomplish what was asked?
- Efficiency: Did it converge in a reasonable number of steps?
- Failure handling: Did it recover gracefully from tool errors?
The trajectory dimension is what makes agent evaluation fundamentally harder. An agent that produces a correct final answer via the wrong path is not correct — it got lucky. A wrong path will produce wrong answers in slightly different inputs that the correct path would handle fine.
RAGAS for Agentic RAG Pipelines
RAGAS (Retrieval-Augmented Generation Assessment) was originally designed for RAG system evaluation, but its metrics apply well to any agentic pipeline that retrieves and synthesizes information.
pip install ragas langchain-openaiCore RAGAS Metrics for Agent Evaluation
Faithfulness measures whether every claim in the agent's answer can be inferred from the retrieved context. A faithfulness score of 0.5 means half the agent's statements have no grounding — they are hallucinations.
Answer Relevancy measures whether the answer is on-topic for the question. An agent that retrieves irrelevant documents and summarizes them will score low here.
Context Precision measures whether the retrieved documents are relevant to the question. Low context precision means your retriever is pulling noise.
Context Recall (requires ground truth) measures whether all the information needed to answer the question was actually retrieved.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Build evaluation dataset from agent runs
eval_data = {
"question": [
"What are the refund policies for enterprise customers?",
"How does the rate limiting work for the API?",
],
"answer": [
"Enterprise customers have a 90-day refund window with full credit.",
"The API uses a sliding window of 1000 requests per minute per organization.",
],
"contexts": [
[
"Enterprise plan includes 90-day money-back guarantee with full credit to account.",
"Refunds are processed within 5 business days.",
],
[
"Rate limits apply per organization. Default: 1000 RPM.",
"Sliding window algorithm resets every 60 seconds.",
],
],
"ground_truth": [
"Enterprise customers get a 90-day full refund window.",
"API rate limits are 1000 requests per minute using a sliding window.",
],
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.95, 'context_recall': 0.83}Evaluating Agent Trajectories with RAGAS
For multi-step agents, collect the trajectory and evaluate each step's retrieved context:
from ragas.metrics import faithfulness
from ragas import evaluate
from datasets import Dataset
def collect_agent_trajectory(agent, question: str) -> dict:
"""Runs agent and collects all retrieved contexts across steps."""
all_contexts = []
steps = []
for step in agent.stream({"question": question}):
if "context" in step:
all_contexts.extend(step["context"])
steps.append(step)
final_answer = steps[-1].get("answer", "")
return {"question": question, "answer": final_answer, "contexts": all_contexts}
# Evaluate across a test set
trajectories = [collect_agent_trajectory(agent, q) for q in test_questions]
dataset = Dataset.from_list(trajectories)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy])Hallucination Detection in Multi-Step Pipelines
Hallucinations in multi-step pipelines compound. If an agent hallucinates a "fact" in step 2, step 3 reasons from that false premise. By step 5, the hallucination has been woven into multiple parts of the answer.
Detect this by evaluating faithfulness at each step, not just the final output:
def check_step_faithfulness(step_output: str, step_contexts: list[str]) -> float:
"""Returns faithfulness score for a single agent step."""
dataset = Dataset.from_dict({
"question": ["intermediate step"],
"answer": [step_output],
"contexts": [step_contexts],
})
result = evaluate(dataset, metrics=[faithfulness])
return result["faithfulness"]
# Flag steps with faithfulness below threshold
FAITHFULNESS_THRESHOLD = 0.7
for i, step in enumerate(agent_trajectory):
score = check_step_faithfulness(step["output"], step["contexts"])
if score < FAITHFULNESS_THRESHOLD:
print(f"WARNING: Step {i} faithfulness={score:.2f} — potential hallucination")PromptFoo for Agent Prompt Testing
PromptFoo specializes in testing prompts and LLM behavior systematically. It is particularly useful for regression testing — catching when a prompt change causes an agent to behave differently on previously-working inputs — and for red-teaming.
npm install -g promptfoo
# or
pip install promptfooBasic PromptFoo Configuration for Agent Testing
# promptfooconfig.yaml
description: "Research agent prompt regression tests"
providers:
- id: openai:gpt-4o
config:
temperature: 0
systemPrompt: |
You are a research assistant. Given a question, decide whether to:
1. Search for more information (respond with SEARCH: <query>)
2. Provide a final answer (respond with ANSWER: <text>)
Always prefer searching before answering if you are not certain.
prompts:
- "{{question}}"
tests:
- description: "Routes to search for factual questions"
vars:
question: "What is the current market cap of Apple?"
assert:
- type: contains
value: "SEARCH:"
- type: not-contains
value: "ANSWER:"
- description: "Provides direct answer for definitional questions"
vars:
question: "What does API stand for?"
assert:
- type: contains
value: "ANSWER:"
- type: llm-rubric
value: "The answer correctly defines API as Application Programming Interface"
- description: "Does not hallucinate specific numbers"
vars:
question: "What was Tesla's revenue in Q3 2019?"
assert:
- type: contains
value: "SEARCH:"
- type: not-contains
value: "$" # Should not guess numbers without searchingRun it:
promptfoo eval --config promptfooconfig.yaml
promptfoo view <span class="hljs-comment"># Opens results dashboardRed-Teaming Agent Prompts with PromptFoo
PromptFoo's red-teaming plugin generates adversarial inputs automatically:
# redteam-config.yaml
description: "Red-team the research agent"
providers:
- openai:gpt-4o
redteam:
plugins:
- prompt-injection # Tests for system prompt leakage
- jailbreak # Tests for policy bypass
- hallucination # Tests for confabulation
- overreliance # Tests if agent over-trusts user input
numTests: 20
prompts:
- "{{question}}"promptfoo redteam run --config redteam-config.yamlRed-teaming is worth running before every significant prompt change and before shipping a new agent capability. The cost is minimal; the protection against embarrassing failures is high.
CI Integration for PromptFoo
# .github/workflows/agent-eval.yaml
name: Agent Prompt Regression Tests
on:
pull_request:
paths:
- "prompts/**"
- "agents/**"
jobs:
promptfoo-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install promptfoo
run: npm install -g promptfoo
- name: Run agent prompt tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: promptfoo eval --config promptfooconfig.yaml --ci
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: promptfoo-results
path: .promptfoo/output.jsonThe --ci flag makes promptfoo exit with a non-zero status code if any tests fail, blocking the merge.
LangSmith Evaluators
LangSmith provides both online evaluation (evaluating real production traces) and offline evaluation (running evaluators against a curated dataset). Both are valuable; they answer different questions.
Offline Evaluation: Building Ground Truth Datasets
Ground truth datasets for agents are expensive to build and easy to build wrong. Avoid these common mistakes:
- Don't use LLM output as ground truth — if you generate ground truth with GPT-4o and evaluate with GPT-4o, you are measuring self-consistency, not correctness
- Don't use the same model for judge and evaluated — use a different, ideally stronger model as judge
- Do capture the full trajectory — not just input/output pairs
# Building a ground truth dataset from human-validated examples
import langsmith
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
dataset_name="research-agent-eval-v1",
description="Human-validated research agent examples"
)
# Add examples with full expected trajectories
examples = [
{
"inputs": {"question": "How does HTTPS work?"},
"outputs": {
"answer": "HTTPS uses TLS to encrypt HTTP traffic...",
"steps_taken": ["search_web", "synthesize"],
"tools_used": ["web_search"],
"expected_tool_calls": 1,
}
},
# ... more examples
]
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id,
)Offline Evaluation with Custom Evaluators
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langsmith import Client
import openai
client = Client()
def trajectory_evaluator(run, example):
"""Evaluates whether the agent used the right tools in the right order."""
predicted_tools = run.outputs.get("tools_used", [])
expected_tools = example.outputs.get("tools_used", [])
if predicted_tools == expected_tools:
return {"key": "trajectory_match", "score": 1.0, "comment": "Correct tool sequence"}
elif set(predicted_tools) == set(expected_tools):
return {"key": "trajectory_match", "score": 0.5, "comment": "Right tools, wrong order"}
else:
missing = set(expected_tools) - set(predicted_tools)
return {"key": "trajectory_match", "score": 0.0, "comment": f"Missing tools: {missing}"}
def answer_correctness_evaluator(run, example):
"""Uses GPT-4o as judge for semantic answer correctness."""
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Rate the following answer on a scale of 0-1 for correctness.
Question: {example.inputs['question']}
Expected: {example.outputs['answer']}
Actual: {run.outputs.get('answer', '')}
Respond with only a number between 0 and 1."""
}]
)
score = float(response.choices[0].message.content.strip())
return {"key": "answer_correctness", "score": score}
# Run evaluation
results = evaluate(
lambda inputs: agent.invoke(inputs),
data="research-agent-eval-v1",
evaluators=[trajectory_evaluator, answer_correctness_evaluator],
experiment_prefix="agent-v2-eval",
max_concurrency=4,
)Online Evaluation: Evaluating Production Traces
Online evaluation runs evaluators against real user interactions captured in LangSmith. This catches distribution shift — when the real inputs your users send differ from your curated test set.
# Set up online evaluator via LangSmith UI or API
# This runs on every production trace matching the filter
from langsmith import Client
client = Client()
# Create an automation rule that evaluates traces in real-time
# (typically done via the LangSmith dashboard)
# The evaluator runs on every trace from your production project
# You can also query and batch-evaluate historical traces:
runs = client.list_runs(
project_name="production-agent",
start_time=datetime.now() - timedelta(days=7),
filter='eq(status, "success")',
)
for run in runs:
score = answer_correctness_evaluator(run, example=None)
client.create_feedback(
run_id=run.id,
key=score["key"],
score=score["score"],
)The Cost of Evaluation at Scale
Running evaluators is not free. A dataset of 500 examples evaluated with an LLM judge at $0.01 per call costs $5 per evaluation run. At 20 CI runs per day across a team, that is $100/day — $3,000/month for evaluation alone.
Strategies to manage evaluation cost:
- Tiered evaluation — fast heuristic checks (regex, schema validation, exact match) run on every PR; expensive LLM-judge evaluation runs weekly or pre-release
- Stratified sampling — maintain a small "gold set" of 50 examples for fast CI, a larger set for thorough evaluation
- Cache evaluations — if the input/output hash matches a previous run, reuse the score
- Use smaller judge models — GPT-4o-mini or Claude Haiku are adequate judges for many evaluation tasks at 10x lower cost
# Tiered evaluation in CI
import hashlib
import json
EVALUATION_CACHE = {}
def cached_evaluate(run, example, evaluator_fn):
"""Skip evaluation if this exact input/output was evaluated before."""
cache_key = hashlib.md5(
json.dumps({"input": run.inputs, "output": run.outputs}, sort_keys=True).encode()
).hexdigest()
if cache_key in EVALUATION_CACHE:
return EVALUATION_CACHE[cache_key]
result = evaluator_fn(run, example)
EVALUATION_CACHE[cache_key] = result
return resultFull CI Pipeline for Agent Evaluation
# .github/workflows/agent-full-eval.yaml
name: Agent Evaluation Suite
on:
schedule:
- cron: "0 2 * * *" # Run nightly
workflow_dispatch:
jobs:
ragas-eval:
name: RAGAS Faithfulness & Relevancy
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install ragas langchain-openai datasets
- run: python evals/run_ragas.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
promptfoo-regression:
name: PromptFoo Prompt Regression
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install -g promptfoo
- run: promptfoo eval --config evals/promptfooconfig.yaml --ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
langsmith-eval:
name: LangSmith Ground Truth Evaluation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install langsmith langchain-openai
- run: python evals/run_langsmith_eval.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
LANGCHAIN_PROJECT: "ci-eval"End-to-End Monitoring Beyond Evaluation
Evaluation frameworks answer "how good is the model?" Monitoring answers "is the live application working right now?"
These are different questions and they need different tools. A perfectly-evaluated agent can fail in production because of a slow tool API, a UI regression, or a prompt formatting change that only affects certain input patterns.
HelpMeTest is built for the monitoring layer — it runs continuous end-to-end health checks against your live AI application using Robot Framework and Playwright, verifying that your agent-powered features work correctly from the user's perspective on an ongoing basis. For teams that have built out their RAGAS and LangSmith evaluation pipelines, HelpMeTest fills the production monitoring gap that evaluation alone cannot cover. At $100/month, it is the operational complement to your offline evaluation suite.
The evaluation landscape for agentic AI is still maturing, but the combination of RAGAS for retrieval quality, PromptFoo for prompt regression, and LangSmith for trajectory and semantic evaluation covers the three most critical dimensions. Start with RAGAS faithfulness — it catches the most common and most harmful failure mode (hallucination) and is the fastest to set up. Add PromptFoo regression tests when your prompts stabilize. Add LangSmith evaluators when you need to track quality over time against a curated ground truth. Run all of it in CI. Monitor production with an independent health check layer.