Testing LLMs for Hallucinations: Methods and Tools

Testing LLMs for Hallucinations: Methods and Tools

LLM hallucinations — confidently stated falsehoods — are one of the hardest failure modes to catch because they look identical to correct outputs. This guide covers the practical techniques developers use to detect hallucinations in production systems and reduce their frequency.

Key Takeaways

Hallucinations are silent failures. A crashed request is easy to detect. A hallucinated fact looks like a successful response and gets shipped to users.

Ground your outputs whenever possible. RAG (Retrieval-Augmented Generation) dramatically reduces factual hallucinations by giving the model source material to cite. Groundedness checks verify the output actually used that material.

Consistency testing catches unstable claims. Run the same factual prompt multiple times. If the model gives different answers, at least one is wrong — and you can flag both for review.

LLM-as-judge for factuality requires a reference. Asking one LLM to evaluate another's factual accuracy only works if the judge has access to a ground-truth source. Without it, judges can confidently validate hallucinations.

Log and monitor. You cannot catch all hallucinations in CI. Sample production outputs continuously and alert on drops in factual consistency scores.

What Hallucinations Actually Are

"Hallucination" in LLM context covers a few distinct failure modes that are often conflated:

  • Factual hallucination: The model states something false with confidence — a wrong date, a nonexistent company, a made-up statistic.
  • Intrinsic hallucination: The output contradicts the source material provided (e.g., a document the model was asked to summarize).
  • Extrinsic hallucination: The output contains claims that can't be verified from any provided source — not necessarily wrong, but not grounded.
  • Self-contradiction: The model says different things about the same fact within one response or across repeated queries.

Testing approaches differ depending on which type you're targeting. Most production AI applications need to address all four.

Why Standard Testing Misses Hallucinations

The problem is that hallucinated outputs are syntactically correct. They parse cleanly, pass length checks, match format requirements, and often have high semantic similarity to correct answers. A response saying "The company was founded in 1987" is structurally identical to one saying "The company was founded in 1997" — but one may be fabricated.

This means you need tests that go beyond structure and format into factual verification.

Method 1: Groundedness Checks

When your system uses RAG or otherwise provides source documents to the model, groundedness checks verify that claims in the output can be traced back to the source.

The approach: for each claim in the output, check whether the source documents contain supporting text. This can be done with:

  • Extractive matching: Use string search or fuzzy matching to find overlapping phrases between output and source.
  • NLI (Natural Language Inference): A classifier that determines whether the source text entails the output claim.
  • LLM-as-judge with grounding: Provide both source and output to a judge LLM and ask "Does the output contain any claims not supported by the source?"

Open-source NLI models like cross-encoder/nli-deberta-v3-base work well for sentence-level groundedness checks without API costs.

from transformers import pipeline

nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")

def check_groundedness(source: str, claim: str) -> bool:
    result = nli(f"{source} [SEP] {claim}")[0]
    return result["label"] == "entailment" and result["score"] > 0.8

Groundedness checking doesn't prove factual accuracy — if your source document is wrong, grounded claims are still wrong. But it catches the most common hallucination pattern: claims invented by the model that have no basis in the provided context.

Method 2: Consistency Testing

If a factual claim is true, the model should assert it consistently across multiple independent queries. If the model gives different answers to the same factual question, at least one answer is wrong.

Self-consistency probing:

  1. Ask the same factual question 5–10 times with temperature > 0.
  2. Extract the specific claim from each response.
  3. Check if the claim is consistent across runs.
  4. If variance is high, flag for human review.
import collections

def consistency_probe(prompt: str, runs: int = 7) -> dict:
    answers = [extract_claim(call_llm(prompt)) for _ in range(runs)]
    freq = collections.Counter(answers)
    majority, count = freq.most_common(1)[0]
    return {
        "majority_answer": majority,
        "consistency_rate": count / runs,
        "all_answers": answers
    }

A consistency rate below 80% on factual questions is a red flag. It means either the model is genuinely uncertain, or the prompt is ambiguous enough to elicit different framings of the fact.

This technique works well for numerical facts, dates, named entities, and binary yes/no questions. It's less effective for open-ended questions where variability is expected.

Method 3: Factual Verification Against a Knowledge Base

For applications where factual accuracy is critical — medical, legal, financial — you need to verify outputs against an authoritative knowledge base.

This means:

  1. Extract specific claims from the output (entity extraction + relation extraction).
  2. Query your knowledge base for those claims.
  3. Flag claims not found in the knowledge base or contradicting it.

This is the most robust approach but also the most expensive to build. You need a structured KB with good coverage of your domain, and extraction pipelines that can parse model outputs into verifiable claims.

Tools like FactScore (Meta research) formalize this: they decompose outputs into atomic facts and score each against a Wikipedia-derived KB. For custom domains, you'd replace Wikipedia with your own data source.

Method 4: Citation Verification

A lightweight form of factual checking: require the model to cite sources for claims, then verify those sources exist and say what the model claims.

Prompt the model to always output citations in a verifiable format:

Always end factual claims with a citation in the format [Source: <document_title>, <section>].

Then in your test harness:

  1. Parse citations from the output.
  2. Retrieve the cited document and section.
  3. Check that the section actually supports the claim.

This works especially well in enterprise RAG applications where all source documents are known and retrievable.

Method 5: Adversarial Prompting

Actively try to elicit hallucinations in your test suite. If you can make the model hallucinate under controlled conditions, you understand its failure modes.

Common adversarial patterns:

  • Ask about obscure facts: "What was the revenue of Acme Corp in Q3 2019?" The model has no reliable data and may invent a number.
  • Contradict the context: Provide a document stating fact X, then ask "Is it true that [not-X]?" Hallucination-prone models will sometimes agree with the question.
  • Request specific details: "List the 7 specific features of Product X" — models often invent extra details when a specific count is demanded.
  • Use leading questions: "The study published in Nature in 2022 showed Y was caused by Z, right?" Models often confirm without verification.

Building an adversarial test set helps you understand which question types and domains produce the most hallucinations in your specific model + prompt combination.

Monitoring in Production

Offline testing catches known failure patterns. Production monitoring catches unknown ones.

What to log for every LLM response:

  • The input prompt and retrieved context (if RAG)
  • The full model output
  • Any structured data extracted from the output
  • Timestamps and model version

What to run asynchronously on a sample:

  • Groundedness scores (if RAG is in use)
  • Consistency checks on high-stakes factual claims
  • LLM-as-judge scoring with a factuality rubric

Alert thresholds:

  • Groundedness score drops below 0.75 on a rolling 100-sample window
  • Consistency rate below 70% on any factual probe cluster
  • User feedback signals (thumbs down, corrections, abandonment after reading)

Reducing Hallucinations: What Actually Works

Testing finds hallucinations; these techniques reduce them:

  • Use RAG over parametric memory: Ground every factual claim in retrieved documents.
  • Lower temperature for factual tasks: Temperature 0 is more consistent, though not hallucination-free.
  • Ask for uncertainty: Prompt the model to say "I don't know" or "I'm not certain" rather than confabulate. Many models will comply if explicitly instructed.
  • Verify before synthesize: Break complex factual tasks into retrieval + verification + synthesis steps.
  • Prefer smaller, fine-tuned models over general ones for narrow domains: a model fine-tuned on your domain hallucinates less about that domain.

No technique eliminates hallucinations entirely. Testing and monitoring are the only way to know how often they're occurring in your system.

Read more