LangSmith for LLM Tracing and Evaluation
LangSmith gives you production observability for LLM applications — full request traces, cost tracking, latency breakdowns, and human annotation queues. Combined with its evaluation layer, you can compare prompt versions, run automated evaluators, and catch regressions before they reach users.
Why Tracing Matters for LLM Applications
An LLM application isn't a single function call. It's a pipeline: retrieve context, format prompt, call model, parse output, maybe call tools, retry on failure. When something goes wrong, you need to know exactly where in that pipeline it broke.
Without tracing, debugging looks like this:
- User says "the answer was wrong"
- You check the final output
- You don't know if retrieval failed, the prompt was malformed, or the model hallucinated
With LangSmith tracing, you see every step — inputs, outputs, latency, token usage, and cost — for every request, in a searchable UI.
Setup
pip install langsmith langchain langchain-openaiSet environment variables:
export LANGCHAIN_TRACING_V2=<span class="hljs-literal">true
<span class="hljs-built_in">export LANGCHAIN_API_KEY=ls__...
<span class="hljs-built_in">export LANGCHAIN_PROJECT=my-app <span class="hljs-comment"># optional, organizes tracesThat's it. If you're using LangChain, tracing is automatic from this point forward.
Automatic Tracing with LangChain
Once env vars are set, every LangChain call is traced:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
("system", "You are a support agent for HelpMeTest. Answer concisely."),
("human", "{question}")
])
chain = prompt | llm
# This run is automatically traced in LangSmith
response = chain.invoke({"question": "What does HelpMeTest Pro cost?"})
print(response.content)Open your LangSmith project — you'll see the trace with the full prompt, model response, latency, and token count.
Manual Tracing for Non-LangChain Code
Not using LangChain? Trace any Python code with the traceable decorator:
from langsmith import traceable
import openai
client = openai.OpenAI()
@traceable(name="retrieve-context")
def retrieve_context(query: str) -> list[str]:
# Your retrieval logic
results = vector_db.search(query, top_k=5)
return [r.text for r in results]
@traceable(name="generate-answer")
def generate_answer(question: str, context: list[str]) -> str:
context_str = "\n".join(context)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer using this context:\n{context_str}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
@traceable(name="rag-pipeline")
def answer_question(question: str) -> str:
context = retrieve_context(question)
return generate_answer(question, context)
# Full pipeline trace — each nested @traceable creates a span
result = answer_question("What is HelpMeTest's free plan limit?")In LangSmith, you'll see:
rag-pipeline (423ms)
├── retrieve-context (156ms)
└── generate-answer (267ms)Tracing with Context Metadata
Add metadata to filter and search traces:
from langsmith import traceable
@traceable(
name="support-chatbot",
tags=["production", "support"],
metadata={"version": "2.1", "user_tier": "pro"}
)
def handle_support_request(user_id: str, question: str) -> str:
# ...
passFilter by tag or metadata in the LangSmith UI to compare production vs. staging, or Pro vs. Free users.
Creating Datasets
LangSmith's evaluation layer starts with datasets — collections of inputs and expected outputs.
From Scratch
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset(
dataset_name="HelpMeTest Support QA",
description="Golden QA pairs for support chatbot evaluation"
)
# Add examples
examples = [
{
"inputs": {"question": "What does HelpMeTest Pro cost?"},
"outputs": {"answer": "HelpMeTest Pro costs $100/month with unlimited tests and parallel execution."}
},
{
"inputs": {"question": "Does HelpMeTest support self-hosting?"},
"outputs": {"answer": "No, HelpMeTest is a cloud-hosted SaaS. Self-hosting is not available."}
},
{
"inputs": {"question": "What testing frameworks does HelpMeTest use?"},
"outputs": {"answer": "HelpMeTest uses Robot Framework with Playwright for browser automation."}
},
{
"inputs": {"question": "How does health monitoring work?"},
"outputs": {"answer": "Use the helpmetest CLI: helpmetest health <name> <grace_period>. Grace periods include 30s, 5m, 2h, 1d."}
},
]
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id
)From Production Traces
Capture real user interactions from production and add them to a dataset:
# Get traces from production runs
runs = client.list_runs(
project_name="production",
run_type="chain",
filter='and(eq(status, "success"), gt(total_tokens, 100))',
limit=100
)
# Add interesting/edge-case examples to dataset
for run in runs:
if run.error or "I don't know" in str(run.outputs):
client.create_examples(
inputs=[run.inputs],
outputs=[run.outputs],
dataset_id=dataset.id
)This is powerful: use production failures to grow your dataset automatically.
Running Evaluations
Once you have a dataset, evaluate your application against it:
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
client = Client()
# The function to evaluate — takes dataset inputs, returns outputs
def predict(inputs: dict) -> dict:
answer = answer_question(inputs["question"])
return {"answer": answer}
# Define evaluators
evaluators = [
# LLM-judged correctness
LangChainStringEvaluator(
"qa",
config={"llm": ChatOpenAI(model="gpt-4o", temperature=0)}
),
# Exact string match
LangChainStringEvaluator("exact_match"),
]
# Run evaluation
results = evaluate(
predict,
data="HelpMeTest Support QA",
evaluators=evaluators,
experiment_prefix="gpt4o-v2",
metadata={"model": "gpt-4o", "prompt_version": "2.1"}
)Results appear in LangSmith under "Experiments" — with per-example scores and aggregate metrics.
Custom Evaluators
Write domain-specific evaluators in Python:
from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate
def check_no_self_hosting_claims(run: Run, example: Example) -> dict:
"""Ensure the model never claims HelpMeTest supports self-hosting."""
output = str(run.outputs.get("answer", ""))
forbidden_phrases = [
"self-host",
"on-premise",
"your own infrastructure",
"deploy yourself"
]
for phrase in forbidden_phrases:
if phrase.lower() in output.lower():
return {
"key": "no_self_hosting_claim",
"score": 0,
"reason": f"Output contains forbidden phrase: '{phrase}'"
}
return {
"key": "no_self_hosting_claim",
"score": 1,
"reason": "Output does not claim self-hosting support"
}
def check_pricing_accuracy(run: Run, example: Example) -> dict:
"""Verify pricing claims match known values."""
output = str(run.outputs.get("answer", ""))
# If the question is about pricing, check for correct values
question = str(run.inputs.get("question", ""))
if "cost" in question.lower() or "price" in question.lower() or "pricing" in question.lower():
if "$100" not in output and "100/month" not in output.lower():
return {
"key": "pricing_accuracy",
"score": 0,
"reason": "Pricing question answered without mentioning $100/month"
}
return {"key": "pricing_accuracy", "score": 1}
results = evaluate(
predict,
data="HelpMeTest Support QA",
evaluators=[check_no_self_hosting_claims, check_pricing_accuracy],
experiment_prefix="custom-eval"
)Annotation Queues for Human Review
Not everything can be evaluated automatically. LangSmith's annotation queues let you route specific traces to human reviewers.
Setting Up a Queue
from langsmith import Client
client = Client()
# Create an annotation queue for low-confidence outputs
queue = client.create_annotation_queue(
name="Low Confidence Review",
description="Outputs where the model expressed uncertainty or gave short answers"
)Routing Traces to the Queue
@traceable(name="support-chatbot")
def handle_support_request(question: str) -> str:
answer = generate_answer(question)
# Route to human review if output seems uncertain
if any(phrase in answer.lower() for phrase in ["i'm not sure", "i don't know", "unclear"]):
# Add current run to annotation queue
# (Use run_id from the current trace context)
pass
return answerIn the LangSmith UI, reviewers see the question, answer, and can mark it correct/incorrect and leave feedback. This builds your golden dataset over time.
Comparing Experiments
LangSmith's experiment comparison is its most powerful feature for iterative development.
# Experiment 1: Current prompt
results_v1 = evaluate(
predict_v1,
data="HelpMeTest Support QA",
experiment_prefix="system-prompt-v1"
)
# Experiment 2: New prompt
results_v2 = evaluate(
predict_v2,
data="HelpMeTest Support QA",
experiment_prefix="system-prompt-v2"
)In the LangSmith UI, select both experiments and click "Compare." You get:
- Aggregate score comparison
- Per-example diff (where v2 improved, where it regressed)
- Statistical significance indicators
This is how you make data-driven prompt engineering decisions instead of gut-feel ones.
Tracing Costs and Latency
LangSmith automatically tracks:
- Tokens per call — input tokens, output tokens, total
- Cost — calculated from provider pricing
- Latency — per step and total
- Error rates — by run type, project, time window
Query this programmatically:
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
# Get runs from the last 24 hours
runs = list(client.list_runs(
project_name="production",
start_time=datetime.utcnow() - timedelta(days=1)
))
total_tokens = sum(r.total_tokens or 0 for r in runs)
total_cost = sum(r.total_cost or 0 for r in runs)
avg_latency = sum(r.latency or 0 for r in runs) / len(runs) if runs else 0
print(f"24h stats:")
print(f" Runs: {len(runs)}")
print(f" Total tokens: {total_tokens:,}")
print(f" Total cost: ${total_cost:.2f}")
print(f" Avg latency: {avg_latency:.2f}s")CI Integration
Fail your build when evaluation scores drop:
# scripts/langsmith_eval.py
import sys
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from langchain_openai import ChatOpenAI
THRESHOLDS = {
"correctness": 0.80,
}
def predict(inputs: dict) -> dict:
from myapp.chatbot import answer_question
return {"answer": answer_question(inputs["question"])}
results = evaluate(
predict,
data="HelpMeTest Support QA",
evaluators=[
LangChainStringEvaluator(
"qa",
config={"llm": ChatOpenAI(model="gpt-4o", temperature=0)}
)
],
experiment_prefix=f"ci-{__import__('os').environ.get('GITHUB_SHA', 'local')[:8]}"
)
# Check aggregate scores
df = results.to_pandas()
correctness = df["feedback.correctness"].mean()
print(f"Correctness: {correctness:.2f} (threshold: {THRESHOLDS['correctness']})")
if correctness < THRESHOLDS["correctness"]:
print("FAILED: Below quality threshold")
sys.exit(1)
print("PASSED")GitHub Actions:
- name: LangSmith eval
env:
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_TRACING_V2: "true"
LANGCHAIN_PROJECT: "ci"
run: python scripts/langsmith_eval.pyLangSmith Without LangChain
LangSmith works with any LLM framework:
from langsmith import traceable, Client
import anthropic
client_anthropic = anthropic.Anthropic()
@traceable(run_type="llm")
def call_claude(prompt: str, system: str = "") -> str:
message = client_anthropic.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
@traceable(name="my-pipeline")
def run_pipeline(question: str) -> str:
context = retrieve_context(question)
answer = call_claude(
prompt=question,
system=f"Answer using this context:\n{'\n'.join(context)}"
)
return answerTraces appear in LangSmith with the same detail as LangChain runs.
LangSmith vs Other Observability Tools
| LangSmith | Weights & Biases | Arize | Custom logging | |
|---|---|---|---|---|
| LangChain integration | Native | Plugin | Plugin | Manual |
| Dataset management | Yes | Yes | Limited | Manual |
| Evaluation layer | Yes | Yes | Yes | Custom |
| Annotation queues | Yes | No | Yes | No |
| Experiment comparison | Yes | Yes | Limited | Manual |
| Cost tracking | Yes | No | Yes | Manual |
LangSmith is the natural choice for LangChain-heavy stacks. For non-LangChain code, evaluate Arize or W&B — though LangSmith's @traceable decorator works well enough.
Next Steps
- Enable tracing in staging immediately — get visibility before problems reach production
- Build your first dataset from golden examples and production traces
- Set up an annotation queue for outputs the model is uncertain about
- Run your first experiment comparison before your next prompt change
- Explore TruLens for an open-source alternative with similar tracing + eval capabilities
For teams that need scheduled evaluation runs with alerting — running your LangSmith evaluations on a cron schedule and notifying when scores drop — HelpMeTest handles the scheduling and alerting layer.