AI Testing

Testing LlamaIndex RAG Pipelines: Retrieval Accuracy, Context Quality, Hallucination Detection

HelpMeTest

17 May 2026 — 8 min read

LlamaIndex is a popular framework for building RAG (Retrieval-Augmented Generation) pipelines. Testing these pipelines requires verifying three distinct components: the retriever (is it finding relevant documents?), the context (is the retrieved content accurate?), and the generator (is the answer faithful to the context and free of hallucination?). This guide covers all three.

Key Takeaways

Test the retriever and generator separately. A hallucinating answer might be the generator's fault or the retriever's fault — a poorly retrieved context forces the model to invent. Isolate the layers.

Retrieval accuracy is measurable with precision and recall. Build a labeled dataset of questions + relevant documents and evaluate retriever performance against it.

Use LlamaIndex's built-in evaluators. FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator give you structured scores without building eval infrastructure from scratch.

Mock the LLM in unit tests, use real LLMs for eval. Unit tests verify pipeline mechanics. Eval tests measure output quality — they need a real (or realistic) LLM.

Test indexing too. The quality of your vector index affects everything downstream. Verify documents are indexed correctly before testing retrieval.

LlamaIndex RAG Pipeline Architecture

A LlamaIndex RAG pipeline has these components:

Documents
  ↓ (ingestion)
Document Transformations (chunking, metadata extraction)
  ↓
Embedding Model → Vector Store (index)
  ↓ (query time)
Query → Embedding → Vector Search → Retrieved Nodes
  ↓
Response Synthesis (LLM generates answer from nodes)
  ↓
Final Response

Each component has testable properties:

Ingestion: Are documents chunked correctly? Is metadata preserved?
Retrieval: Are relevant documents returned for the query?
Context: Is retrieved content sufficient and accurate?
Generation: Is the answer faithful to the context?

Testing Indexing and Ingestion

Document Chunking

from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
import pytest

def test_document_chunking_preserves_content():
    """Verify chunking doesn't lose document content."""
    document = Document(
        text="LlamaIndex is a framework for LLM applications. " * 100,  # Long document
        metadata={"source": "docs/intro.md"}
    )

    splitter = SentenceSplitter(chunk_size=256, chunk_overlap=20)
    nodes = splitter.get_nodes_from_documents([document])

    # Reconstruct the original text from chunks (minus overlaps)
    all_text = " ".join(node.text for node in nodes)
    assert "LlamaIndex is a framework" in all_text
    assert len(nodes) > 1, "Long document should produce multiple chunks"

def test_chunk_size_respected():
    """Chunks should not exceed the specified token limit."""
    from llama_index.core import Settings
    from llama_index.core.node_parser import SentenceSplitter

    splitter = SentenceSplitter(chunk_size=128)
    document = Document(text="This is a test sentence. " * 100)
    nodes = splitter.get_nodes_from_documents([document])

    for node in nodes:
        # Approximate token count (rough: 1 token ≈ 4 chars)
        approx_tokens = len(node.text) / 4
        assert approx_tokens <= 200, f"Chunk too large: ~{approx_tokens:.0f} tokens"

def test_metadata_preserved_through_chunking():
    """Document metadata should be present on all derived nodes."""
    document = Document(
        text="Important document content. " * 50,
        metadata={"source": "report.pdf", "author": "Alice", "page": 1}
    )

    splitter = SentenceSplitter(chunk_size=128)
    nodes = splitter.get_nodes_from_documents([document])

    for node in nodes:
        assert "source" in node.metadata, "Source metadata missing from node"
        assert node.metadata["author"] == "Alice"

Index Construction

from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from unittest.mock import patch, MagicMock

def test_index_built_from_documents():
    """Verify index is constructed with the expected number of documents."""
    docs = [
        Document(text="HelpMeTest is a testing platform.", metadata={"id": "1"}),
        Document(text="It supports Robot Framework and Playwright.", metadata={"id": "2"}),
        Document(text="Pricing starts at $100/month for Pro.", metadata={"id": "3"}),
    ]

    with patch("llama_index.core.Settings.embed_model") as mock_embed:
        mock_embed.get_text_embedding.return_value = [0.1] * 1536
        mock_embed.get_text_embedding_batch.return_value = [[0.1] * 1536] * 10

        index = VectorStoreIndex.from_documents(docs)
        assert index.docstore.docs  # Not empty
        assert len(index.docstore.docs) >= 3  # At least one node per document

Testing the Retriever

Precision and Recall Measurement

Build a labeled dataset where you know which documents are relevant to each query:

import pytest
from llama_index.core import VectorStoreIndex, Document
from dataclasses import dataclass

@dataclass
class RetrievalTestCase:
    query: str
    relevant_doc_ids: set[str]

RETRIEVAL_TEST_CASES = [
    RetrievalTestCase(
        query="What is the return policy?",
        relevant_doc_ids={"doc-returns-policy", "doc-customer-service"}
    ),
    RetrievalTestCase(
        query="How do I integrate with CI/CD?",
        relevant_doc_ids={"doc-ci-integration", "doc-github-actions"}
    ),
    RetrievalTestCase(
        query="What programming languages are supported?",
        relevant_doc_ids={"doc-sdk-overview", "doc-python-sdk", "doc-js-sdk"}
    ),
]

def compute_precision_recall(retrieved_ids: set, relevant_ids: set) -> tuple[float, float]:
    if not retrieved_ids:
        return 0.0, 0.0
    true_positives = len(retrieved_ids & relevant_ids)
    precision = true_positives / len(retrieved_ids)
    recall = true_positives / len(relevant_ids) if relevant_ids else 0.0
    return precision, recall

def test_retriever_precision_recall():
    """Retriever achieves minimum precision and recall on labeled test cases."""
    index = load_production_index()  # Load your built index
    retriever = index.as_retriever(similarity_top_k=5)

    precisions, recalls = [], []
    for test_case in RETRIEVAL_TEST_CASES:
        nodes = retriever.retrieve(test_case.query)
        retrieved_ids = {node.node.metadata.get("doc_id") for node in nodes}

        p, r = compute_precision_recall(retrieved_ids, test_case.relevant_doc_ids)
        precisions.append(p)
        recalls.append(r)

    avg_precision = sum(precisions) / len(precisions)
    avg_recall = sum(recalls) / len(recalls)

    assert avg_precision >= 0.7, f"Avg precision {avg_precision:.2f} below 0.70"
    assert avg_recall >= 0.6, f"Avg recall {avg_recall:.2f} below 0.60"

Testing Top-K Relevance

def test_top_result_is_most_relevant():
    """The highest-ranked retrieved document should be the most relevant."""
    index = load_production_index()
    retriever = index.as_retriever(similarity_top_k=3)

    nodes = retriever.retrieve("What is the return policy?")

    assert len(nodes) > 0, "No documents retrieved"
    top_doc = nodes[0].node.text.lower()
    assert "return" in top_doc or "refund" in top_doc, \
        f"Top result not about returns: {top_doc[:200]}"

def test_retriever_filters_by_metadata():
    """Metadata filters correctly narrow the retrieval scope."""
    from llama_index.core.vector_stores.types import MetadataFilters, ExactMatchFilter

    index = load_production_index()
    retriever = index.as_retriever(
        similarity_top_k=5,
        filters=MetadataFilters(
            filters=[ExactMatchFilter(key="category", value="pricing")]
        )
    )

    nodes = retriever.retrieve("How much does it cost?")
    for node in nodes:
        assert node.node.metadata.get("category") == "pricing", \
            f"Retrieved node not in pricing category: {node.node.metadata}"

Testing Response Quality with LlamaIndex Evaluators

LlamaIndex ships with built-in evaluators that use an LLM-as-judge pattern:

from llama_index.core import VectorStoreIndex
from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    RetrieverEvaluator,
)
from llama_index.llms.openai import OpenAI
import pytest

@pytest.fixture
def eval_llm():
    return OpenAI(model="gpt-4o-mini", temperature=0)

def test_response_faithfulness(eval_llm):
    """LLM response should not hallucinate facts not in the retrieved context."""
    evaluator = FaithfulnessEvaluator(llm=eval_llm)

    index = load_production_index()
    query_engine = index.as_query_engine()

    response = query_engine.query("What is the Pro plan price?")

    eval_result = evaluator.evaluate_response(
        query="What is the Pro plan price?",
        response=response
    )

    assert eval_result.passing, \
        f"Faithfulness check failed: {eval_result.feedback}"

def test_response_relevancy(eval_llm):
    """Response should be relevant to the question asked."""
    evaluator = RelevancyEvaluator(llm=eval_llm)

    index = load_production_index()
    query_engine = index.as_query_engine()

    response = query_engine.query("Does HelpMeTest support Playwright?")

    eval_result = evaluator.evaluate_response(
        query="Does HelpMeTest support Playwright?",
        response=response
    )

    assert eval_result.passing, \
        f"Relevancy check failed: {eval_result.feedback}"

def test_response_correctness_with_ground_truth(eval_llm):
    """Response correctness against known ground truth."""
    evaluator = CorrectnessEvaluator(llm=eval_llm)

    index = load_production_index()
    query_engine = index.as_query_engine()

    response = query_engine.query("How much does HelpMeTest Pro cost per month?")

    eval_result = evaluator.evaluate_response(
        query="How much does HelpMeTest Pro cost per month?",
        response=response,
        reference="HelpMeTest Pro costs $100 per month."
    )

    assert eval_result.score >= 3.0, \
        f"Correctness score {eval_result.score}/5.0 below threshold. Feedback: {eval_result.feedback}"

Batch Evaluation with Datasets

For systematic evaluation across many examples:

from llama_index.core.evaluation import BatchEvalRunner
import asyncio

async def run_batch_evaluation():
    """Evaluate the RAG pipeline across a full test dataset."""
    index = load_production_index()
    query_engine = index.as_query_engine()
    llm = OpenAI(model="gpt-4o-mini", temperature=0)

    questions = [
        "What is the return policy?",
        "How do I integrate with GitHub Actions?",
        "What is the free plan limit?",
        "Does HelpMeTest support parallel test execution?",
        "How do I set up health monitoring?",
    ]

    evaluators = {
        "faithfulness": FaithfulnessEvaluator(llm=llm),
        "relevancy": RelevancyEvaluator(llm=llm),
    }

    runner = BatchEvalRunner(evaluators, workers=4)
    eval_results = await runner.aevaluate_queries(
        query_engine,
        queries=questions
    )

    # Summarize results
    for metric_name, results in eval_results.items():
        pass_rate = sum(1 for r in results if r.passing) / len(results)
        print(f"{metric_name}: {pass_rate:.0%} pass rate")
        assert pass_rate >= 0.8, f"{metric_name} pass rate {pass_rate:.0%} below 80%"

def test_batch_evaluation():
    asyncio.run(run_batch_evaluation())

Testing Hallucination

Hallucination testing verifies that the model doesn't add facts not present in the retrieved context:

from llama_index.core import VectorStoreIndex, Document
from llama_index.core.evaluation import FaithfulnessEvaluator

def test_no_hallucination_on_unknown_topic():
    """When documents don't contain relevant info, the response should admit uncertainty."""
    # Index only contains pricing information
    docs = [
        Document(text="HelpMeTest Pro costs $100/month with unlimited tests."),
    ]
    index = VectorStoreIndex.from_documents(docs)
    query_engine = index.as_query_engine()

    # Ask about something not in the index
    response = query_engine.query("What is the company's founding date?")

    # The response should NOT invent a founding date
    response_text = str(response).lower()
    assert "founded" not in response_text or "don't have" in response_text or "not available" in response_text, \
        f"Possible hallucination detected: {response_text}"

def test_faithfulness_score_across_dataset(eval_llm):
    """Average faithfulness score across dataset should exceed threshold."""
    evaluator = FaithfulnessEvaluator(llm=eval_llm)
    index = load_production_index()
    query_engine = index.as_query_engine()

    test_queries = load_eval_dataset()  # list of {"query": str} dicts
    scores = []

    for item in test_queries:
        response = query_engine.query(item["query"])
        result = evaluator.evaluate_response(
            query=item["query"],
            response=response
        )
        scores.append(1.0 if result.passing else 0.0)

    avg_faithfulness = sum(scores) / len(scores)
    assert avg_faithfulness >= 0.85, \
        f"Average faithfulness {avg_faithfulness:.2f} below 0.85 threshold"

Performance Testing

RAG pipelines can be slow. Test that retrieval and generation stay within acceptable latency bounds:

import time, statistics

def test_query_latency():
    """End-to-end query should complete within 5 seconds."""
    index = load_production_index()
    query_engine = index.as_query_engine()

    queries = [
        "What is the return policy?",
        "How do I set up CI integration?",
        "What browsers are supported?",
    ]

    latencies = []
    for query in queries:
        start = time.perf_counter()
        response = query_engine.query(query)
        elapsed = time.perf_counter() - start
        latencies.append(elapsed)
        assert elapsed < 10.0, f"Query '{query}' took {elapsed:.1f}s (>10s limit)"

    p95 = sorted(latencies)[int(len(latencies) * 0.95)]
    print(f"P95 latency: {p95:.2f}s")
    assert p95 < 8.0, f"P95 latency {p95:.2f}s exceeds 8s"

def test_retrieval_latency():
    """Retrieval (without generation) should be fast."""
    index = load_production_index()
    retriever = index.as_retriever(similarity_top_k=5)

    start = time.perf_counter()
    nodes = retriever.retrieve("What is the pricing?")
    elapsed = time.perf_counter() - start

    assert elapsed < 1.0, f"Retrieval took {elapsed:.2f}s (>1s limit)"

CI Integration

# .github/workflows/llamaindex-eval.yml
name: LlamaIndex RAG Evaluation
on:
  push:
    paths:
      - 'src/rag/**'
      - 'data/documents/**'
  schedule:
    - cron: '0 6 * * 1'  # Weekly evaluation

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install llama-index pytest
      - run: pytest tests/rag/test_chunking.py tests/rag/test_retriever.py -v

  eval-tests:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - run: pip install llama-index pytest ragas
      - run: pytest tests/rag/test_evaluation.py -v --timeout=300
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

HelpMeTest Integration

HelpMeTest adds end-to-end testing to your LlamaIndex application, verifying the complete user experience:

*** Test Cases ***
RAG Chat Returns Relevant Answer
    Go To  https://app.example.com/ask
    Input Text  id=question  What is the return policy?
    Click Button  id=ask-btn
    Wait Until Page Contains Element  .answer-text  timeout=15s
    ${answer}=  Get Text  .answer-text
    Should Contain  ${answer}  30 days
    Should Not Be Empty  ${answer}

RAG Chat Shows Source Documents
    Go To  https://app.example.com/ask
    Input Text  id=question  How do I enable CI integration?
    Click Button  id=ask-btn
    Wait Until Page Contains  Sources:  timeout=15s
    Page Should Contain Element  .source-document

No Answer For Out-Of-Scope Questions
    Go To  https://app.example.com/ask
    Input Text  id=question  What is the population of France?
    Click Button  id=ask-btn
    Wait Until Page Contains Element  .answer-text  timeout=15s
    ${answer}=  Get Text  .answer-text
    Should Contain Any  ${answer}  don't have  not available  outside my knowledge

Conclusion

Testing LlamaIndex RAG pipelines well requires a three-layer strategy: unit tests for pipeline mechanics (chunking, metadata, retriever logic), offline eval tests with LlamaIndex's built-in evaluators for quality measurement, and end-to-end tests for the complete user experience.

The built-in evaluators (FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator) provide the most value for minimal setup — start there. Build your labeled retrieval dataset next, since retrieval quality is the largest single determinant of RAG pipeline quality. Add latency tests when performance becomes a concern.

Run unit tests on every push and evaluation tests weekly or before major releases. This gives you fast feedback on mechanical regressions while catching quality regressions before they reach production users.