AI Testing

Arize Phoenix: LLM Observability and Testing Guide (2026)

HelpMeTest

23 May 2026 — 4 min read

LLM applications fail in ways that traditional monitoring misses entirely. A REST API either returns 200 or it doesn't. An LLM can return 200 and still produce hallucinated facts, broken reasoning chains, or responses that drift from your prompt instructions over time. Arize Phoenix exists to make that invisible failure visible.

This guide covers how to instrument your LLM application with Phoenix, what to look for in the traces, and how to integrate evaluation into your CI pipeline so regressions don't reach production.

What Arize Phoenix Does

Phoenix is an open-source observability platform built specifically for LLM applications. It provides:

Tracing — captures every LLM call, retrieval step, and tool invocation as a structured span tree
Evaluation — runs LLM-as-judge and deterministic evals against your traces
Dataset management — stores curated examples for regression testing
Experiment tracking — compares prompt versions, model versions, and retrieval configs

Phoenix runs locally (no cloud account needed) or connects to the hosted Arize platform. For most teams, the local version is sufficient for development and CI.

Installation and Setup

pip install arize-phoenix opentelemetry-sdk opentelemetry-exporter-otlp

Start the Phoenix server:

python -m phoenix.server.main serve

This launches the UI at http://localhost:6006. Now instrument your application:

import phoenix as px
from phoenix.otel import register
from opentelemetry import trace

# Register Phoenix as the OTLP trace collector
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

tracer = trace.get_tracer(__name__)

Tracing LLM Calls

Phoenix uses OpenTelemetry spans. If you're using OpenAI, the auto-instrumentation handles everything:

from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Now every OpenAI call is automatically traced
import openai
client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document: ..."}]
)

The trace in Phoenix will show token counts, latency, model parameters, and the full prompt/response pair. For RAG pipelines, wrap your retrieval step as well:

from openinference.instrumentation.langchain import LangChainInstrumentor

LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

Phoenix supports LangChain, LlamaIndex, DSPy, and direct OpenAI/Anthropic calls out of the box.

Manual Spans for Custom Logic

When your pipeline has custom steps — reranking, business logic, preprocessing — add manual spans:

with tracer.start_as_current_span("document-reranker") as span:
    span.set_attribute("reranker.model", "cohere-rerank-v3")
    span.set_attribute("reranker.input_count", len(candidates))
    
    reranked = rerank_documents(query, candidates)
    
    span.set_attribute("reranker.output_count", len(reranked))
    span.set_attribute("reranker.top_score", reranked[0].score)

This gives you end-to-end visibility: you can see how retrieval quality affects generation quality within the same trace.

Running Evaluations

Traces alone tell you what happened. Evaluations tell you whether it was good. Phoenix ships with built-in evaluators:

import phoenix as px
from phoenix.evals import (
    HallucinationEvaluator,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)
from phoenix.evals import OpenAIModel

# Pull traces from Phoenix
client = px.Client()
trace_df = client.get_spans_dataframe(project_name="my-llm-app")

# Configure the judge model
eval_model = OpenAIModel(model="gpt-4o")

# Run hallucination detection
hallucination_eval = HallucinationEvaluator(eval_model)
qa_eval = QAEvaluator(eval_model)

results = run_evals(
    dataframe=trace_df,
    evaluators=[hallucination_eval, qa_eval],
    provide_explanation=True,
)

print(results.head())

The evaluators use LLM-as-judge to classify each response. Hallucination eval checks whether the answer is supported by the retrieved context. QA eval checks whether the answer correctly addresses the question.

Building a Test Dataset

One-time evaluation isn't enough. You need a curated dataset of inputs with known-good outputs that you can run on every deployment.

# Create a dataset in Phoenix
dataset = client.upload_dataset(
    dataset_name="customer-support-regression",
    dataframe=pd.DataFrame([
        {
            "input": "How do I reset my password?",
            "expected_output": "Go to Settings > Security > Reset Password",
            "context": "Password reset documentation...",
        },
        {
            "input": "What are your pricing plans?",
            "expected_output": "We offer Free, Pro ($100/mo), and Enterprise tiers",
            "context": "Pricing page content...",
        },
    ]),
)

Then run your pipeline against this dataset:

import phoenix as px
from phoenix.experiments import run_experiment

def my_pipeline(example):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful support agent."},
            {"role": "user", "content": example["input"]},
        ]
    )
    return response.choices[0].message.content

experiment = run_experiment(
    dataset=dataset,
    task=my_pipeline,
    evaluators=[hallucination_eval, qa_eval],
    experiment_name="v2-prompt-test",
)

Phoenix stores each experiment result, so you can compare prompt v1 vs v2 side by side.

CI Integration

Add this to your CI pipeline to catch regressions before deployment:

# .github/workflows/llm-eval.yml
name: LLM Regression Tests

on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Start Phoenix
        run: |
          pip install arize-phoenix
          python -m phoenix.server.main serve &
          sleep 5
      
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/run_evals.py --fail-below 0.85

# scripts/run_evals.py
import argparse
import sys
import phoenix as px
from phoenix.evals import HallucinationEvaluator, run_evals, OpenAIModel

parser = argparse.ArgumentParser()
parser.add_argument("--fail-below", type=float, default=0.85)
args = parser.parse_args()

# Run your pipeline against the test dataset
# ... (same as above)

# Check pass rate
pass_rate = results["score"].mean()
print(f"Eval pass rate: {pass_rate:.2%}")

if pass_rate < args.fail_below:
    print(f"FAIL: pass rate {pass_rate:.2%} below threshold {args.fail_below:.2%}")
    sys.exit(1)

What to Monitor in Production

Once deployed, configure Phoenix alerts for:

Latency p99 — LLM calls that take more than 5 seconds hurt UX
Token usage — sudden spikes indicate prompt injection or runaway loops
Hallucination rate — run evals on a sample of production traces daily
Retrieval precision — if your RAG pipeline returns irrelevant chunks, generation quality drops

Phoenix exports metrics in Prometheus format, so you can feed them into your existing Grafana dashboards.

When Phoenix Fits (and When It Doesn't)

Phoenix is excellent for:

RAG pipelines where you need to trace retrieval + generation together
Teams that want local-first, open-source tooling
Comparing prompt or model versions systematically

It's less suited for:

Real-time production alerting (Arize's hosted platform is better for that)
Non-LLM ML models (use MLflow or Weights & Biases instead)
Teams that need SOC 2 compliance without hosting themselves

Connecting to End-to-End Testing

LLM observability catches model-level regressions, but your users interact with a full application: authentication, UI, API layers, and the LLM backend all working together. Tools like HelpMeTest handle the end-to-end layer — testing that the user-facing flow still works when you swap models or update prompts — while Phoenix handles the LLM internals.

Using both gives you full coverage: Phoenix tells you why a response was wrong; end-to-end tests tell you whether the feature still works from the user's perspective.

Summary

Arize Phoenix gives LLM engineers the visibility that traditional monitoring can't provide. Instrument with one line using OpenAI or LangChain auto-instrumentation, build a regression dataset from your best examples, and gate deployments on eval pass rate in CI. Combine it with end-to-end testing to cover both model behavior and user-facing correctness.