Testing

LangSmith for LLM Tracing and Evaluation

HelpMeTest

16 May 2026 — 7 min read

LangSmith gives you production observability for LLM applications — full request traces, cost tracking, latency breakdowns, and human annotation queues. Combined with its evaluation layer, you can compare prompt versions, run automated evaluators, and catch regressions before they reach users.

Why Tracing Matters for LLM Applications

An LLM application isn't a single function call. It's a pipeline: retrieve context, format prompt, call model, parse output, maybe call tools, retry on failure. When something goes wrong, you need to know exactly where in that pipeline it broke.

Without tracing, debugging looks like this:

User says "the answer was wrong"
You check the final output
You don't know if retrieval failed, the prompt was malformed, or the model hallucinated

With LangSmith tracing, you see every step — inputs, outputs, latency, token usage, and cost — for every request, in a searchable UI.

Setup

pip install langsmith langchain langchain-openai

Set environment variables:

export LANGCHAIN_TRACING_V2=<span class="hljs-literal">true
<span class="hljs-built_in">export LANGCHAIN_API_KEY=ls__...
<span class="hljs-built_in">export LANGCHAIN_PROJECT=my-app  <span class="hljs-comment"># optional, organizes traces

That's it. If you're using LangChain, tracing is automatic from this point forward.

Automatic Tracing with LangChain

Once env vars are set, every LangChain call is traced:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o")

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a support agent for HelpMeTest. Answer concisely."),
    ("human", "{question}")
])

chain = prompt | llm

# This run is automatically traced in LangSmith
response = chain.invoke({"question": "What does HelpMeTest Pro cost?"})
print(response.content)

Open your LangSmith project — you'll see the trace with the full prompt, model response, latency, and token count.

Manual Tracing for Non-LangChain Code

Not using LangChain? Trace any Python code with the traceable decorator:

from langsmith import traceable
import openai

client = openai.OpenAI()

@traceable(name="retrieve-context")
def retrieve_context(query: str) -> list[str]:
    # Your retrieval logic
    results = vector_db.search(query, top_k=5)
    return [r.text for r in results]

@traceable(name="generate-answer")
def generate_answer(question: str, context: list[str]) -> str:
    context_str = "\n".join(context)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using this context:\n{context_str}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

@traceable(name="rag-pipeline")
def answer_question(question: str) -> str:
    context = retrieve_context(question)
    return generate_answer(question, context)

# Full pipeline trace — each nested @traceable creates a span
result = answer_question("What is HelpMeTest's free plan limit?")

In LangSmith, you'll see:

rag-pipeline (423ms)
├── retrieve-context (156ms)
└── generate-answer (267ms)

Tracing with Context Metadata

Add metadata to filter and search traces:

from langsmith import traceable

@traceable(
    name="support-chatbot",
    tags=["production", "support"],
    metadata={"version": "2.1", "user_tier": "pro"}
)
def handle_support_request(user_id: str, question: str) -> str:
    # ...
    pass

Filter by tag or metadata in the LangSmith UI to compare production vs. staging, or Pro vs. Free users.

Creating Datasets

LangSmith's evaluation layer starts with datasets — collections of inputs and expected outputs.

From Scratch

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset(
    dataset_name="HelpMeTest Support QA",
    description="Golden QA pairs for support chatbot evaluation"
)

# Add examples
examples = [
    {
        "inputs": {"question": "What does HelpMeTest Pro cost?"},
        "outputs": {"answer": "HelpMeTest Pro costs $100/month with unlimited tests and parallel execution."}
    },
    {
        "inputs": {"question": "Does HelpMeTest support self-hosting?"},
        "outputs": {"answer": "No, HelpMeTest is a cloud-hosted SaaS. Self-hosting is not available."}
    },
    {
        "inputs": {"question": "What testing frameworks does HelpMeTest use?"},
        "outputs": {"answer": "HelpMeTest uses Robot Framework with Playwright for browser automation."}
    },
    {
        "inputs": {"question": "How does health monitoring work?"},
        "outputs": {"answer": "Use the helpmetest CLI: helpmetest health <name> <grace_period>. Grace periods include 30s, 5m, 2h, 1d."}
    },
]

client.create_examples(
    inputs=[e["inputs"] for e in examples],
    outputs=[e["outputs"] for e in examples],
    dataset_id=dataset.id
)

From Production Traces

Capture real user interactions from production and add them to a dataset:

# Get traces from production runs
runs = client.list_runs(
    project_name="production",
    run_type="chain",
    filter='and(eq(status, "success"), gt(total_tokens, 100))',
    limit=100
)

# Add interesting/edge-case examples to dataset
for run in runs:
    if run.error or "I don't know" in str(run.outputs):
        client.create_examples(
            inputs=[run.inputs],
            outputs=[run.outputs],
            dataset_id=dataset.id
        )

This is powerful: use production failures to grow your dataset automatically.

Running Evaluations

Once you have a dataset, evaluate your application against it:

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

client = Client()

# The function to evaluate — takes dataset inputs, returns outputs
def predict(inputs: dict) -> dict:
    answer = answer_question(inputs["question"])
    return {"answer": answer}

# Define evaluators
evaluators = [
    # LLM-judged correctness
    LangChainStringEvaluator(
        "qa",
        config={"llm": ChatOpenAI(model="gpt-4o", temperature=0)}
    ),
    # Exact string match
    LangChainStringEvaluator("exact_match"),
]

# Run evaluation
results = evaluate(
    predict,
    data="HelpMeTest Support QA",
    evaluators=evaluators,
    experiment_prefix="gpt4o-v2",
    metadata={"model": "gpt-4o", "prompt_version": "2.1"}
)

Results appear in LangSmith under "Experiments" — with per-example scores and aggregate metrics.

Custom Evaluators

Write domain-specific evaluators in Python:

from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate

def check_no_self_hosting_claims(run: Run, example: Example) -> dict:
    """Ensure the model never claims HelpMeTest supports self-hosting."""
    output = str(run.outputs.get("answer", ""))
    
    forbidden_phrases = [
        "self-host",
        "on-premise",
        "your own infrastructure",
        "deploy yourself"
    ]
    
    for phrase in forbidden_phrases:
        if phrase.lower() in output.lower():
            return {
                "key": "no_self_hosting_claim",
                "score": 0,
                "reason": f"Output contains forbidden phrase: '{phrase}'"
            }
    
    return {
        "key": "no_self_hosting_claim",
        "score": 1,
        "reason": "Output does not claim self-hosting support"
    }


def check_pricing_accuracy(run: Run, example: Example) -> dict:
    """Verify pricing claims match known values."""
    output = str(run.outputs.get("answer", ""))
    
    # If the question is about pricing, check for correct values
    question = str(run.inputs.get("question", ""))
    if "cost" in question.lower() or "price" in question.lower() or "pricing" in question.lower():
        if "$100" not in output and "100/month" not in output.lower():
            return {
                "key": "pricing_accuracy",
                "score": 0,
                "reason": "Pricing question answered without mentioning $100/month"
            }
    
    return {"key": "pricing_accuracy", "score": 1}


results = evaluate(
    predict,
    data="HelpMeTest Support QA",
    evaluators=[check_no_self_hosting_claims, check_pricing_accuracy],
    experiment_prefix="custom-eval"
)

Annotation Queues for Human Review

Not everything can be evaluated automatically. LangSmith's annotation queues let you route specific traces to human reviewers.

Setting Up a Queue

from langsmith import Client

client = Client()

# Create an annotation queue for low-confidence outputs
queue = client.create_annotation_queue(
    name="Low Confidence Review",
    description="Outputs where the model expressed uncertainty or gave short answers"
)

Routing Traces to the Queue

@traceable(name="support-chatbot")
def handle_support_request(question: str) -> str:
    answer = generate_answer(question)
    
    # Route to human review if output seems uncertain
    if any(phrase in answer.lower() for phrase in ["i'm not sure", "i don't know", "unclear"]):
        # Add current run to annotation queue
        # (Use run_id from the current trace context)
        pass
    
    return answer

In the LangSmith UI, reviewers see the question, answer, and can mark it correct/incorrect and leave feedback. This builds your golden dataset over time.

Comparing Experiments

LangSmith's experiment comparison is its most powerful feature for iterative development.

# Experiment 1: Current prompt
results_v1 = evaluate(
    predict_v1,
    data="HelpMeTest Support QA",
    experiment_prefix="system-prompt-v1"
)

# Experiment 2: New prompt
results_v2 = evaluate(
    predict_v2,
    data="HelpMeTest Support QA",
    experiment_prefix="system-prompt-v2"
)

In the LangSmith UI, select both experiments and click "Compare." You get:

Aggregate score comparison
Per-example diff (where v2 improved, where it regressed)
Statistical significance indicators

This is how you make data-driven prompt engineering decisions instead of gut-feel ones.

Tracing Costs and Latency

LangSmith automatically tracks:

Tokens per call — input tokens, output tokens, total
Cost — calculated from provider pricing
Latency — per step and total
Error rates — by run type, project, time window

Query this programmatically:

from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# Get runs from the last 24 hours
runs = list(client.list_runs(
    project_name="production",
    start_time=datetime.utcnow() - timedelta(days=1)
))

total_tokens = sum(r.total_tokens or 0 for r in runs)
total_cost = sum(r.total_cost or 0 for r in runs)
avg_latency = sum(r.latency or 0 for r in runs) / len(runs) if runs else 0

print(f"24h stats:")
print(f"  Runs: {len(runs)}")
print(f"  Total tokens: {total_tokens:,}")
print(f"  Total cost: ${total_cost:.2f}")
print(f"  Avg latency: {avg_latency:.2f}s")

CI Integration

Fail your build when evaluation scores drop:

# scripts/langsmith_eval.py
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_openai import ChatOpenAI

THRESHOLDS = {
    "correctness": 0.80,
}

def predict(inputs: dict) -> dict:
    from myapp.chatbot import answer_question
    return {"answer": answer_question(inputs["question"])}

results = evaluate(
    predict,
    data="HelpMeTest Support QA",
    evaluators=[
        LangChainStringEvaluator(
            "qa",
            config={"llm": ChatOpenAI(model="gpt-4o", temperature=0)}
        )
    ],
    experiment_prefix=f"ci-{__import__('os').environ.get('GITHUB_SHA', 'local')[:8]}"
)

# Check aggregate scores
df = results.to_pandas()
correctness = df["feedback.correctness"].mean()

print(f"Correctness: {correctness:.2f} (threshold: {THRESHOLDS['correctness']})")

if correctness < THRESHOLDS["correctness"]:
    print("FAILED: Below quality threshold")
    sys.exit(1)

print("PASSED")

GitHub Actions:

- name: LangSmith eval
  env:
    LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    LANGCHAIN_TRACING_V2: "true"
    LANGCHAIN_PROJECT: "ci"
  run: python scripts/langsmith_eval.py

LangSmith Without LangChain

LangSmith works with any LLM framework:

from langsmith import traceable, Client
import anthropic

client_anthropic = anthropic.Anthropic()

@traceable(run_type="llm")
def call_claude(prompt: str, system: str = "") -> str:
    message = client_anthropic.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

@traceable(name="my-pipeline")
def run_pipeline(question: str) -> str:
    context = retrieve_context(question)
    answer = call_claude(
        prompt=question,
        system=f"Answer using this context:\n{'\n'.join(context)}"
    )
    return answer

Traces appear in LangSmith with the same detail as LangChain runs.

LangSmith vs Other Observability Tools

	LangSmith	Weights & Biases	Arize	Custom logging
LangChain integration	Native	Plugin	Plugin	Manual
Dataset management	Yes	Yes	Limited	Manual
Evaluation layer	Yes	Yes	Yes	Custom
Annotation queues	Yes	No	Yes	No
Experiment comparison	Yes	Yes	Limited	Manual
Cost tracking	Yes	No	Yes	Manual

LangSmith is the natural choice for LangChain-heavy stacks. For non-LangChain code, evaluate Arize or W&B — though LangSmith's @traceable decorator works well enough.

Next Steps

Enable tracing in staging immediately — get visibility before problems reach production
Build your first dataset from golden examples and production traces
Set up an annotation queue for outputs the model is uncertain about
Run your first experiment comparison before your next prompt change
Explore TruLens for an open-source alternative with similar tracing + eval capabilities

For teams that need scheduled evaluation runs with alerting — running your LangSmith evaluations on a cron schedule and notifying when scores drop — HelpMeTest handles the scheduling and alerting layer.