AI Testing

W&B Weave for LLM Evaluation: Track, Debug, and Improve AI Apps

HelpMeTest

23 May 2026 — 5 min read

Weights & Biases built its reputation on ML experiment tracking — recording every hyperparameter, metric, and artifact from model training runs. Weave extends that discipline to LLM applications, where the "experiment" isn't a training run but a prompt, a retrieval config, or a pipeline change. If you've already used W&B for model training, Weave applies the same rigorous tracking to your LLM evaluation workflow.

This guide covers how to use Weave to trace LLM calls, build evaluation datasets, run structured experiments, and gate deployments on quality metrics.

Why Weave Instead of Just Logging

The naive approach to LLM debugging is to print prompt inputs and model outputs. This breaks down quickly:

You can't compare prompt v1 vs v2 systematically across 200 test cases
You can't reproduce a specific failure state hours later
You can't track whether a fix to one failure introduced a regression elsewhere
You can't quantify quality — "the responses seem better" isn't a metric

Weave gives you a database of every call, a structured way to compare versions, and evaluators that turn "seems better" into a number you can threshold in CI.

Installation

pip install weave openai  # or anthropic, langchain, etc.

Basic Instrumentation

import weave
import openai

# Initialize Weave with your W&B project
weave.init("my-llm-project")

# Decorate functions you want to trace
@weave.op()
def generate_answer(question: str, context: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Answer questions based on the provided context only.",
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}",
            },
        ],
    )
    return response.choices[0].message.content

# Call it normally — Weave captures everything automatically
answer = generate_answer(
    question="What is the return policy?",
    context="Returns accepted within 30 days with receipt.",
)

Every call to generate_answer is now logged to W&B with inputs, outputs, latency, and token usage. The @weave.op() decorator is the only change required.

Tracing Multi-Step Pipelines

For RAG pipelines, decorate each step:

@weave.op()
def retrieve_context(query: str, top_k: int = 5) -> list[str]:
    """Retrieve relevant documents from vector store."""
    results = vector_store.similarity_search(query, k=top_k)
    return [doc.page_content for doc in results]

@weave.op()
def rerank_context(query: str, candidates: list[str]) -> list[str]:
    """Rerank candidates by relevance."""
    # Your reranking logic here
    return sorted(candidates, key=lambda x: relevance_score(query, x), reverse=True)

@weave.op()
def rag_pipeline(question: str) -> str:
    """Full RAG pipeline — traced as a tree of spans."""
    contexts = retrieve_context(question)
    reranked = rerank_context(question, contexts)
    answer = generate_answer(question, "\n".join(reranked[:3]))
    return answer

In the W&B UI, you'll see rag_pipeline as the root span with retrieve_context, rerank_context, and generate_answer as children. When a response is wrong, you can immediately see whether retrieval returned bad chunks or generation hallucinated from good chunks.

Building Evaluation Datasets

Weave datasets are versioned collections of examples. Create them from your best manually-verified examples:

import weave

dataset = weave.Dataset(
    name="customer-support-qa",
    rows=[
        {
            "question": "How do I reset my password?",
            "context": "Navigate to Settings > Security > Reset Password and follow the prompts.",
            "expected": "Go to Settings, then Security, then Reset Password.",
        },
        {
            "question": "What payment methods do you accept?",
            "context": "We accept Visa, Mastercard, PayPal, and bank transfers.",
            "expected": "Visa, Mastercard, PayPal, and bank transfers.",
        },
        {
            "question": "Can I get a refund after 60 days?",
            "context": "Returns accepted within 30 days with receipt.",
            "expected": "No, refunds are only available within 30 days.",
        },
    ],
)

weave.publish(dataset)

Datasets are versioned automatically — if you add examples or correct labels, the previous version is preserved.

Writing Evaluators

Evaluators score your model's output for each example. Weave supports both LLM-as-judge and deterministic evaluators:

import weave
from weave import Evaluation

class AnswerCorrectness(weave.Scorer):
    """LLM-as-judge: is the answer factually correct given the context?"""
    
    @weave.op()
    def score(self, output: str, expected: str) -> dict:
        client = openai.OpenAI()
        
        judge_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a judge evaluating answer quality. "
                        "Score the answer as 1 (correct) or 0 (incorrect) based on "
                        "whether it matches the expected answer semantically. "
                        "Respond with JSON: {\"correct\": 0 or 1, \"reason\": \"...\"}"
                    ),
                },
                {
                    "role": "user",
                    "content": f"Expected: {expected}\nActual: {output}",
                },
            ],
            response_format={"type": "json_object"},
        )
        
        result = json.loads(judge_response.choices[0].message.content)
        return {"correct": result["correct"], "reason": result["reason"]}


class AnswerLength(weave.Scorer):
    """Deterministic: is the answer concise?"""
    max_words: int = 50
    
    @weave.op()
    def score(self, output: str) -> dict:
        word_count = len(output.split())
        return {
            "concise": word_count <= self.max_words,
            "word_count": word_count,
        }

Running Evaluations

evaluation = Evaluation(
    dataset=dataset,
    scorers=[AnswerCorrectness(), AnswerLength(max_words=50)],
)

# Run the evaluation against your pipeline
results = await evaluation.evaluate(rag_pipeline)

print(f"Correctness: {results['AnswerCorrectness']['correct']['mean']:.2%}")
print(f"Conciseness: {results['AnswerLength']['concise']['mean']:.2%}")

Results are logged to W&B with full traceability — you can click into any example and see the exact prompt, retrieved context, model output, and judge reasoning.

Comparing Prompt Versions

The killer feature for iterative development: compare prompt A vs prompt B across the same dataset.

# Version 1: Simple system prompt
@weave.op()
def pipeline_v1(question: str) -> str:
    contexts = retrieve_context(question)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based on context only."},
            {"role": "user", "content": f"Context: {'\n'.join(contexts)}\n\n{question}"},
        ],
    )
    return response.choices[0].message.content

# Version 2: More structured prompt
@weave.op()
def pipeline_v2(question: str) -> str:
    contexts = retrieve_context(question)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a precise support agent. "
                    "Answer using ONLY the provided context. "
                    "If the context doesn't contain the answer, say 'I don't have that information.' "
                    "Keep answers under 50 words."
                ),
            },
            {"role": "user", "content": f"Context: {'\n'.join(contexts)}\n\nQuestion: {question}"},
        ],
    )
    return response.choices[0].message.content

# Evaluate both
results_v1 = await evaluation.evaluate(pipeline_v1)
results_v2 = await evaluation.evaluate(pipeline_v2)

print(f"V1 correctness: {results_v1['AnswerCorrectness']['correct']['mean']:.2%}")
print(f"V2 correctness: {results_v2['AnswerCorrectness']['correct']['mean']:.2%}")

W&B shows both experiments side-by-side. You can see which examples improved and which regressed.

CI Integration

Block deployments when eval scores drop:

# scripts/run_llm_evals.py
import asyncio
import sys
import weave
from weave import Evaluation

async def main():
    weave.init("my-llm-project")
    
    dataset = weave.ref("customer-support-qa:latest").get()
    
    evaluation = Evaluation(
        dataset=dataset,
        scorers=[AnswerCorrectness(), AnswerLength()],
    )
    
    results = await evaluation.evaluate(rag_pipeline)
    
    correctness = results["AnswerCorrectness"]["correct"]["mean"]
    print(f"Correctness: {correctness:.2%}")
    
    if correctness < 0.85:
        print(f"FAIL: correctness {correctness:.2%} below 85% threshold")
        sys.exit(1)
    
    print("PASS")

asyncio.run(main())

# .github/workflows/llm-eval.yml
name: LLM Quality Gate

on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install weave openai
      - name: Run LLM evaluations
        env:
          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/run_llm_evals.py

What Weave Doesn't Cover

Weave is excellent for tracking LLM call quality, but it operates at the model output level. It can tell you that your pipeline's answers scored 92% on correctness. It can't tell you whether:

The login flow still works after you changed the authentication middleware
The UI renders the AI response correctly on mobile
The streaming response doesn't break when the model outputs a code block

End-to-end application testing with HelpMeTest covers the user-facing layer. Run Weave evaluations to gate on model quality, and end-to-end tests to gate on application behavior — both are required for confident deployments.

Summary

W&B Weave brings experiment tracking discipline to LLM development. Instrument with @weave.op(), build versioned datasets from your best examples, write evaluators that turn quality judgments into numbers, and block deployments when scores drop below threshold. The combination of tracing, versioned datasets, and structured comparisons makes it the strongest option for teams already using W&B for ML training.