LLM-as-Judge: Evaluation Patterns for AI Application Testing

LLM-as-Judge: Evaluation Patterns for AI Application Testing

Testing LLM applications is hard because the outputs are open-ended text — there's no single correct answer. LLM-as-judge sidesteps this by using a capable model to score another model's outputs according to defined rubrics. This guide covers the key evaluation patterns: G-Eval for rubric-based scoring, pairwise comparison for model selection, reference-free evaluation for open-ended tasks, and how to build reliable LLM judge pipelines in CI.

Key Takeaways

G-Eval produces calibrated scores, not binary pass/fail. It gives a probability-weighted score across 1-5 using chain-of-thought reasoning, which correlates better with human judgment than simple binary checks.

Use pairwise comparison for model selection decisions. Ask the judge "which response is better, A or B?" — models are much more consistent at relative ranking than absolute scoring.

Faithfulness and relevance require different judge prompts. A judge prompt for RAG faithfulness (does the answer stay grounded in context?) differs fundamentally from one for relevance (does it answer the question?). Write separate prompts.

Always evaluate your judge before trusting it. Run the judge on a small golden dataset with known human labels. If judge agreement with humans is below 80%, your judge prompt needs work.

LLM judge variance is real — run multiple samples. Temperature-based randomness means a single judge call can disagree with itself. Run 3-5 samples and take the majority vote for critical evaluations.

Why LLM-as-Judge Exists

Traditional software testing compares outputs to expected values. LLM outputs don't work that way. Ask a model "explain photosynthesis" and there are thousands of correct answers at varying levels of detail, tone, and emphasis.

Human evaluation solves this — but doesn't scale. Evaluating 10,000 model responses per CI run is not feasible with human annotators. LLM-as-judge fills the gap: use a capable model (typically GPT-4, Claude, or Gemini) as an automated evaluator that produces scores correlating with human judgment.

The approach works because frontier LLMs are genuinely good at following evaluation rubrics, understanding nuance, and producing consistent scores — when prompted carefully.

Core Evaluation Dimensions

Before writing judge prompts, decide what you're measuring:

Coherence — Is the response internally consistent? Does each sentence follow logically from the previous?

Relevance — Does the response address the actual question asked?

Faithfulness — For RAG systems, does the response stay grounded in the retrieved context? Does it avoid hallucinating facts not present in the sources?

Completeness — Does the response cover all required aspects of the question?

Fluency — Is the language natural and grammatically correct?

Toxicity / Safety — Does the response contain harmful, biased, or inappropriate content?

Different use cases weight these differently. A customer support bot prioritizes relevance and faithfulness. A code generation model prioritizes correctness and completeness.

G-Eval: Rubric-Based Scoring

G-Eval is a framework from Liu et al. (2023) that produces calibrated numerical scores using chain-of-thought reasoning.

The G-Eval Pattern

from openai import OpenAI

client = OpenAI()

COHERENCE_PROMPT = """
You will be given a question and a response. Your task is to evaluate the coherence of the response.

Evaluation steps:
1. Read the question and response carefully.
2. Identify whether each sentence in the response follows logically from the previous.
3. Check if the response stays on topic throughout.
4. Check for contradictions within the response.
5. Assign a coherence score from 1-5:
   - 1: Very incoherent, contradictory, or unfocused
   - 2: Mostly incoherent with some logical sequences
   - 3: Moderately coherent with some gaps
   - 4: Mostly coherent with minor issues
   - 5: Perfectly coherent and well-structured

Question: {question}
Response: {response}

Evaluation (think step by step, then provide score):
"""

def evaluate_coherence(question: str, response: str, n_samples: int = 5) -> float:
    scores = []
    
    for _ in range(n_samples):
        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": COHERENCE_PROMPT.format(
                    question=question,
                    response=response
                )
            }],
            temperature=0.0,
            logprobs=True,
            top_logprobs=5
        )
        
        # Extract score from response text
        text = result.choices[0].message.content
        score = extract_score(text)
        scores.append(score)
    
    # Return mean score (or use probability-weighted if using logprobs)
    return sum(scores) / len(scores)

def extract_score(text: str) -> float:
    """Extract numerical score from G-Eval output."""
    import re
    # Look for pattern like "Score: 4" or just a digit at end
    matches = re.findall(r'\b([1-5])\b(?:\s*(?:out of 5|/5))?\.?\s*$', text.strip())
    if matches:
        return float(matches[-1])
    # Fallback: find any digit 1-5 near end of text
    matches = re.findall(r'\b([1-5])\b', text[-200:])
    return float(matches[-1]) if matches else 3.0

Using DeepEval for G-Eval

The deepeval library provides production-ready G-Eval implementations:

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def test_customer_support_coherence():
    coherence_metric = GEval(
        name="Coherence",
        criteria="Determine whether the response is coherent, well-structured, "
                 "and each sentence is logically connected to the previous one.",
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        threshold=0.7,
        model="gpt-4o"
    )
    
    test_case = LLMTestCase(
        input="How do I reset my password?",
        actual_output=my_support_bot.respond("How do I reset my password?")
    )
    
    assert_test(test_case, [coherence_metric])

def test_rag_faithfulness():
    faithfulness_metric = GEval(
        name="Faithfulness",
        criteria="Determine whether the actual output is faithful to the retrieved context. "
                 "The output should not contain any information that cannot be inferred from the context.",
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.RETRIEVAL_CONTEXT
        ],
        threshold=0.8,
        model="gpt-4o"
    )
    
    test_case = LLMTestCase(
        input="What are our refund policies?",
        actual_output=rag_system.query("What are our refund policies?"),
        retrieval_context=rag_system.get_context("What are our refund policies?")
    )
    
    assert_test(test_case, [faithfulness_metric])

Pairwise Comparison for Model Selection

When choosing between two models or prompt variants, pairwise comparison is more reliable than absolute scoring:

PAIRWISE_PROMPT = """
You are comparing two AI responses to the same question.

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Compare these responses on:
1. Accuracy and factual correctness
2. Completeness
3. Clarity and readability
4. Helpfulness to the user

Which response is better overall? Reply with exactly one of:
- "A" if Response A is clearly better
- "B" if Response B is clearly better  
- "tie" if they are roughly equivalent

Your verdict:"""

def compare_responses(
    question: str,
    response_a: str,
    response_b: str,
    n_trials: int = 10
) -> dict:
    """
    Run pairwise comparison with position bias mitigation.
    Alternates A/B order to control for position bias.
    """
    votes = {"a": 0, "b": 0, "tie": 0}
    
    for i in range(n_trials):
        # Alternate order to reduce position bias
        if i % 2 == 0:
            prompt = PAIRWISE_PROMPT.format(
                question=question,
                response_a=response_a,
                response_b=response_b
            )
            mapping = {"A": "a", "B": "b", "tie": "tie"}
        else:
            prompt = PAIRWISE_PROMPT.format(
                question=question,
                response_a=response_b,  # swapped
                response_b=response_a   # swapped
            )
            mapping = {"A": "b", "B": "a", "tie": "tie"}  # inverted mapping
        
        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0
        )
        
        verdict = result.choices[0].message.content.strip().strip('"')
        normalized = mapping.get(verdict, "tie")
        votes[normalized] += 1
    
    return {
        "winner": max(votes, key=votes.get),
        "votes": votes,
        "win_rate_a": votes["a"] / n_trials,
        "win_rate_b": votes["b"] / n_trials
    }

# Use in A/B testing prompt variants
result = compare_responses(
    question="How do I cancel my subscription?",
    response_a=model_v1.generate("How do I cancel my subscription?"),
    response_b=model_v2.generate("How do I cancel my subscription?")
)
print(f"Winner: {result['winner']} (A wins {result['win_rate_a']:.0%} of trials)")

Reference-Free Evaluation

Many real applications have no ground truth to compare against. Reference-free judges evaluate quality from the question alone:

RELEVANCE_PROMPT = """
Given a question and a response, evaluate whether the response is relevant to the question.

A relevant response:
- Directly addresses what was asked
- Does not include excessive irrelevant information
- Is appropriately scoped (not too broad, not too narrow)

Question: {question}
Response: {response}

Score from 1 (completely irrelevant) to 5 (perfectly relevant).
Briefly explain your reasoning, then state your score.

Reasoning:
Score:"""

def batch_evaluate_relevance(
    qa_pairs: list[dict],
    batch_size: int = 20
) -> list[float]:
    """Evaluate relevance for a batch of question-answer pairs."""
    scores = []
    
    for i in range(0, len(qa_pairs), batch_size):
        batch = qa_pairs[i:i + batch_size]
        
        # Parallel evaluation using asyncio
        import asyncio
        from openai import AsyncOpenAI
        
        async def evaluate_one(qa):
            aclient = AsyncOpenAI()
            result = await aclient.chat.completions.create(
                model="gpt-4o-mini",  # Cheaper for bulk evaluation
                messages=[{
                    "role": "user",
                    "content": RELEVANCE_PROMPT.format(
                        question=qa["question"],
                        response=qa["response"]
                    )
                }],
                temperature=0.0
            )
            return extract_score(result.choices[0].message.content)
        
        batch_scores = asyncio.run(
            asyncio.gather(*[evaluate_one(qa) for qa in batch])
        )
        scores.extend(batch_scores)
    
    return scores

Validating Your Judge

Before deploying a judge in CI, measure its agreement with human labels:

import pandas as pd
from sklearn.metrics import cohen_kappa_score

def validate_judge(
    golden_dataset: list[dict],  # {question, response, human_score}
    judge_fn,
    threshold: float = 0.7
) -> dict:
    """
    Measure judge reliability against human labels.
    Returns agreement metrics.
    """
    judge_scores = []
    human_scores = []
    
    for item in golden_dataset:
        judge_score = judge_fn(item["question"], item["response"])
        judge_scores.append(round(judge_score))  # Round to integer for kappa
        human_scores.append(item["human_score"])
    
    kappa = cohen_kappa_score(human_scores, judge_scores)
    exact_agreement = sum(
        1 for h, j in zip(human_scores, judge_scores) if h == j
    ) / len(golden_dataset)
    adjacent_agreement = sum(
        1 for h, j in zip(human_scores, judge_scores) if abs(h - j) <= 1
    ) / len(golden_dataset)
    
    results = {
        "cohen_kappa": kappa,
        "exact_agreement": exact_agreement,
        "adjacent_agreement": adjacent_agreement,
        "n_samples": len(golden_dataset)
    }
    
    if kappa < threshold:
        raise ValueError(
            f"Judge reliability too low: kappa={kappa:.2f} < {threshold}. "
            "Refine your judge prompt before using in CI."
        )
    
    return results

# Validate before trusting the judge
results = validate_judge(
    golden_dataset=load_human_labels("data/golden_eval.jsonl"),
    judge_fn=evaluate_coherence
)
print(f"Judge kappa: {results['cohen_kappa']:.2f}")
print(f"Adjacent agreement: {results['adjacent_agreement']:.0%}")

CI Integration

Add LLM-as-judge to your CI pipeline with pytest:

# tests/eval/test_model_quality.py
import pytest
from deepeval import assert_test, evaluate
from deepeval.metrics import GEval, HallucinationMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset

# Load golden test cases from disk
DATASET = EvaluationDataset()
DATASET.add_test_cases_from_json_file(
    file_path="data/eval_cases.json",
    input_key="question",
    actual_output_key="expected_response"
)

@pytest.mark.parametrize("test_case", DATASET.test_cases)
def test_response_quality(test_case: LLMTestCase):
    # Generate actual output from current model
    test_case.actual_output = production_model.generate(test_case.input)
    
    relevance = GEval(
        name="Relevance",
        criteria="The response directly and completely answers the question asked.",
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        threshold=0.75
    )
    
    coherence = GEval(
        name="Coherence", 
        criteria="The response is well-structured and internally consistent.",
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        threshold=0.70
    )
    
    assert_test(test_case, [relevance, coherence])
# .github/workflows/eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/model/**'
  schedule:
    - cron: '0 2 * * *'  # Nightly regression check

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      
      - name: Install dependencies
        run: pip install deepeval pytest
      
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
        run: pytest tests/eval/ -v --tb=short
      
      - name: Upload evaluation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: .deepeval/

Common Pitfalls

Position bias: LLM judges prefer whichever option appears first (A over B). Mitigate by randomizing order and averaging results from both orderings.

Verbosity bias: Models tend to rate longer responses higher, even when a concise answer is more appropriate. Explicitly address this in your judge prompt: "Do not reward responses that are unnecessarily long."

Self-evaluation bias: A model evaluating its own outputs tends to score them higher. Use a different model family as judge than the model being evaluated.

Prompt sensitivity: Small wording changes in judge prompts can shift scores by 0.5-1 point. Freeze your judge prompts with version control and re-validate when changing them.

Missing abstain option: Forcing a 1-5 score when the judge is uncertain produces noise. Add an "insufficient information" option for ambiguous cases.

Continuous Monitoring with HelpMeTest

LLM quality isn't static — model providers update their APIs, prompts drift, and data distributions shift. HelpMeTest can run your LLM judge evaluations on a schedule, alerting you when quality metrics drop below threshold:

*** Settings ***
Library    HelpMeTest

*** Test Cases ***
Nightly LLM Quality Check
    [Documentation]    Runs G-Eval coherence check on 100 production samples
    ${samples}=    Get Recent Production Samples    count=100
    FOR    ${sample}    IN    @{samples}
        ${score}=    Evaluate Coherence    ${sample.question}    ${sample.response}
        Should Be True    ${score} >= 3.5    msg=Quality degradation detected
    END

Set this to run nightly and get alerted before users notice quality regressions. The Pro plan ($100/month) covers unlimited test runs with parallel execution — enough to evaluate hundreds of model outputs per night without managing infrastructure.

Summary

LLM-as-judge makes LLM application testing scalable:

  • G-Eval for rubric-based numerical scoring with chain-of-thought reasoning
  • Pairwise comparison for model selection and A/B testing, with order randomization to control position bias
  • Reference-free evaluation for open-ended tasks with no ground truth
  • Judge validation against human labels before deploying in CI
  • CI integration via pytest + deepeval to catch quality regressions on every PR

The key insight: your judge is only as good as your judge prompt. Invest time writing precise evaluation criteria, validate against human labels, and version control your prompts.

Read more