Testing

LLM Hallucination Testing: Strategies to Detect Fabricated AI Outputs

HelpMeTest

24 May 2026 — 6 min read

LLMs hallucinate. They generate plausible-sounding text that is factually wrong, cite papers that don't exist, quote statistics that were never measured, and confidently describe events that never happened. If your product uses an LLM to generate content that users trust — customer support answers, medical information, legal summaries, financial data — hallucination testing is not optional.

This guide covers practical techniques for detecting hallucinations automatically, building test suites that catch them, and establishing quality thresholds that protect your users.

What Is a Hallucination?

In LLM context, hallucination refers to output that is confidently stated but factually incorrect or ungrounded. There are several categories:

Factual hallucination: Stating false facts ("The Eiffel Tower is 450 meters tall" — it's 330m)
Source hallucination: Citing a paper, study, or URL that doesn't exist
Context hallucination: Claiming the provided context says something it doesn't
Entity hallucination: Inventing people, companies, or events
Numerical hallucination: Fabricating statistics, dates, or measurements

Each category requires a different detection approach.

Approach 1: Fact Verification Against a Source of Truth

For domains where you have a ground-truth database (product info, prices, policies), compare LLM outputs against it:

import re
from your_db import get_product_info

def extract_claimed_prices(text: str) -> list[dict]:
    """Extract price claims from LLM output."""
    pattern = r'\$(\d+(?:\.\d{2})?)\s*(?:per\s+month|/month|monthly)?'
    matches = re.findall(pattern, text)
    return [float(m) for m in matches]

def test_price_claims_are_accurate(llm_client, product_db):
    query = "What does the Pro plan cost?"
    response = llm_client.complete(query)
    
    claimed_prices = extract_claimed_prices(response)
    actual_price = product_db.get_price("pro_plan")
    
    assert len(claimed_prices) > 0, "Response should mention a price"
    assert actual_price in claimed_prices, (
        f"LLM claimed {claimed_prices}, actual price is {actual_price}"
    )

def test_feature_claims_are_accurate(llm_client):
    query = "Does the free plan support SSO?"
    response = llm_client.complete(query)
    response_lower = response.lower()
    
    # We know SSO is NOT in the free plan
    hallucination_indicators = ["yes, the free plan", "free plan includes sso", "free plan supports sso"]
    for indicator in hallucination_indicators:
        assert indicator not in response_lower, (
            f"LLM incorrectly claimed free plan has SSO: {response}"
        )

Approach 2: NLI-Based Faithfulness Checking

Natural Language Inference (NLI) models can determine whether a hypothesis (LLM answer) is entailed by a premise (source context). This is faster and cheaper than using an LLM as judge:

from transformers import pipeline

nli_pipeline = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-base",
)

def check_entailment(premise: str, hypothesis: str) -> dict:
    """
    Returns entailment score. Higher = LLM answer is supported by context.
    Labels: ENTAILMENT, NEUTRAL, CONTRADICTION
    """
    result = nli_pipeline(f"{premise} [SEP] {hypothesis}")[0]
    return result

def test_answer_entailed_by_context(rag_pipeline):
    query = "What is the maximum file upload size?"
    context = rag_pipeline.retrieve_context(query)
    answer = rag_pipeline.generate_answer(query, context)
    
    # Check each sentence of the answer against the context
    sentences = answer.split('.')
    for sentence in sentences:
        sentence = sentence.strip()
        if len(sentence) < 10:
            continue
        
        result = check_entailment(premise=context, hypothesis=sentence)
        assert result['label'] != 'CONTRADICTION', (
            f"Answer sentence contradicts context:\n"
            f"Sentence: {sentence}\n"
            f"Context: {context}"
        )

Approach 3: LLM-as-Judge

Use a separate, capable LLM to evaluate the output of your primary LLM. This is more expensive but catches nuanced hallucinations:

import openai
import json

def judge_hallucinations(
    question: str,
    context: str,
    answer: str,
    judge_model: str = "gpt-4o"
) -> dict:
    prompt = f"""You are a fact-checking judge. Evaluate whether the Answer contains any hallucinations — claims that are not supported by the Context or that contradict known facts.

Question: {question}
Context: {context}
Answer: {answer}

Respond with JSON only:
{{
  "has_hallucination": boolean,
  "hallucinations": [
    {{
      "claim": "the specific claim",
      "issue": "why it's a hallucination",
      "severity": "low|medium|high"
    }}
  ],
  "faithfulness_score": 0.0-1.0
}}"""
    
    response = openai.chat.completions.create(
        model=judge_model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

def test_customer_support_bot_faithfulness(support_bot, knowledge_base):
    test_cases = [
        "How do I cancel my subscription?",
        "Can I get a refund after 60 days?",
        "What happens to my data if I downgrade?",
    ]
    
    for question in test_cases:
        context = knowledge_base.search(question)
        answer = support_bot.respond(question)
        
        judgment = judge_hallucinations(question, context, answer)
        
        high_severity = [
            h for h in judgment.get("hallucinations", [])
            if h["severity"] == "high"
        ]
        
        assert not high_severity, (
            f"High-severity hallucinations detected for '{question}':\n"
            + "\n".join(f"- {h['claim']}: {h['issue']}" for h in high_severity)
        )
        
        assert judgment["faithfulness_score"] >= 0.8, (
            f"Faithfulness score {judgment['faithfulness_score']:.2f} below threshold for '{question}'"
        )

Approach 4: Citation Verification

When your LLM cites sources, verify those sources exist and actually say what's claimed:

import requests
from urllib.parse import urlparse

def verify_url_exists(url: str) -> bool:
    try:
        response = requests.head(url, timeout=5, allow_redirects=True)
        return response.status_code < 400
    except Exception:
        return False

def extract_urls(text: str) -> list[str]:
    url_pattern = r'https?://[^\s\)\"\'<>]+'
    return re.findall(url_pattern, text)

def test_no_hallucinated_urls(llm_client):
    query = "What are some resources for learning about container security?"
    response = llm_client.complete(query)
    
    urls = extract_urls(response)
    
    broken_urls = []
    for url in urls:
        if not verify_url_exists(url):
            broken_urls.append(url)
    
    assert not broken_urls, (
        f"LLM cited {len(broken_urls)} non-existent URLs:\n"
        + "\n".join(broken_urls)
    )

def test_academic_citations_format(llm_client):
    """LLM should not cite non-existent academic papers."""
    query = "What studies show about AI reliability?"
    response = llm_client.complete(query)
    
    # Check for common hallucinated citation patterns
    suspicious_patterns = [
        r'Smith et al\. \(\d{4}\)',  # Generic academic citation
        r'Journal of AI Research, \d+',  # Real journal but may have fake volume
    ]
    
    for pattern in suspicious_patterns:
        matches = re.findall(pattern, response)
        for match in matches:
            # Flag for manual review or automated DOI check
            print(f"WARNING: Potential fabricated citation: {match}")

Approach 5: Consistency Testing

Run the same query multiple times and check for contradictions. A well-behaved LLM should give consistent answers about facts:

def test_answer_consistency(llm_client):
    query = "What is the company's data retention policy?"
    
    responses = [llm_client.complete(query) for _ in range(5)]
    
    # Extract specific claims (e.g., retention period)
    retention_pattern = r'(\d+)\s*(?:days?|months?|years?)'
    
    claimed_periods = set()
    for response in responses:
        matches = re.findall(retention_pattern, response.lower())
        claimed_periods.update(matches)
    
    assert len(claimed_periods) <= 2, (
        f"LLM gave inconsistent retention periods across runs: {claimed_periods}\n"
        f"Responses:\n" + "\n---\n".join(responses)
    )

Building a Hallucination Test Suite

Structure your hallucination tests by risk level:

# tests/hallucination/critical.py — runs on every deployment
CRITICAL_FACTS = [
    # (query, fact_that_must_appear, fact_that_must_not_appear)
    (
        "What is the price of the Pro plan?",
        "$100",
        None,
    ),
    (
        "Is there a free trial?",
        None,
        "unlimited",  # We don't offer unlimited free trial
    ),
    (
        "What's your uptime SLA?",
        "99.9%",
        None,
    ),
]

@pytest.mark.parametrize("query, must_include, must_exclude", CRITICAL_FACTS)
def test_critical_facts(llm_client, query, must_include, must_exclude):
    response = llm_client.complete(query)
    
    if must_include:
        assert must_include in response, (
            f"Response missing critical fact '{must_include}' for query: {query}\n"
            f"Response: {response}"
        )
    
    if must_exclude:
        assert must_exclude not in response.lower(), (
            f"Response contains prohibited claim '{must_exclude}' for query: {query}\n"
            f"Response: {response}"
        )

Measuring Hallucination Rate Over Time

Track hallucination rates as a quality metric, not just a pass/fail test:

import json
from datetime import datetime

def run_hallucination_benchmark(llm_client, test_suite: list, judge_model: str) -> dict:
    results = []
    
    for case in test_suite:
        answer = llm_client.complete(case["query"])
        judgment = judge_hallucinations(
            question=case["query"],
            context=case.get("context", ""),
            answer=answer,
            judge_model=judge_model,
        )
        results.append({
            "query": case["query"],
            "has_hallucination": judgment["has_hallucination"],
            "faithfulness_score": judgment["faithfulness_score"],
            "hallucinations": judgment["hallucinations"],
        })
    
    hallucination_rate = sum(1 for r in results if r["has_hallucination"]) / len(results)
    avg_faithfulness = sum(r["faithfulness_score"] for r in results) / len(results)
    
    benchmark = {
        "timestamp": datetime.utcnow().isoformat(),
        "hallucination_rate": hallucination_rate,
        "avg_faithfulness": avg_faithfulness,
        "total_cases": len(results),
        "hallucinated_cases": sum(1 for r in results if r["has_hallucination"]),
        "results": results,
    }
    
    # Save benchmark for trend tracking
    with open(f"benchmarks/hallucination_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
        json.dump(benchmark, f, indent=2)
    
    return benchmark

def test_hallucination_rate_acceptable(llm_client, production_test_suite):
    benchmark = run_hallucination_benchmark(llm_client, production_test_suite, "gpt-4o-mini")
    
    assert benchmark["hallucination_rate"] <= 0.05, (
        f"Hallucination rate {benchmark['hallucination_rate']:.1%} exceeds 5% threshold"
    )
    assert benchmark["avg_faithfulness"] >= 0.90, (
        f"Average faithfulness {benchmark['avg_faithfulness']:.2f} below 0.90 threshold"
    )

Prompting to Reduce Hallucinations

Testing catches hallucinations; good prompting reduces them. Include these patterns in your system prompts and test that the model follows them:

ANTI_HALLUCINATION_SYSTEM_PROMPT = """
You are a helpful assistant. Follow these rules:
1. Only state facts that are in the provided context.
2. If you don't know something, say "I don't have information about that."
3. Never cite sources, URLs, or statistics that you aren't certain exist.
4. Do not estimate or guess numerical values — only state what's in the context.
"""

def test_model_declines_unknown_questions(llm_client):
    """LLM should decline to answer questions outside its knowledge."""
    out_of_scope_queries = [
        "What is our competitor's pricing?",
        "How many employees does our company have?",
        "What was our Q3 revenue?",
    ]
    
    for query in out_of_scope_queries:
        response = llm_client.complete(
            query,
            system_prompt=ANTI_HALLUCINATION_SYSTEM_PROMPT,
            context="[Company product documentation — no employee or financial data]"
        )
        
        decline_indicators = [
            "don't have information",
            "not in my context",
            "i cannot",
            "i don't know",
            "no information available",
        ]
        
        declined = any(indicator in response.lower() for indicator in decline_indicators)
        assert declined, (
            f"LLM should have declined to answer '{query}' but responded:\n{response}"
        )

Hallucination testing is an ongoing process, not a one-time audit. LLM behavior drifts as models are updated, prompts change, and new use cases emerge. Run your hallucination test suite in CI, track the rate over time, and set alert thresholds so you catch regressions before your users find them.

LLM Hallucination Testing: Strategies to Detect Fabricated AI Outputs

HelpMeTest

What Is a Hallucination?

Approach 1: Fact Verification Against a Source of Truth

Approach 2: NLI-Based Faithfulness Checking

Approach 3: LLM-as-Judge

Approach 4: Citation Verification

Approach 5: Consistency Testing

Building a Hallucination Test Suite

Measuring Hallucination Rate Over Time

Prompting to Reduce Hallucinations

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest