Literal AI Testing Guide: Thread Tracking, Datasets, and Scoring
Conversational AI applications present a testing challenge that single-turn LLM pipelines don't: the context of a full conversation matters. What a user said three turns ago can determine whether turn seven's response is correct or wrong. Standard observability tools that log individual LLM calls miss this. Literal AI was built with conversational AI in mind — its thread model captures multi-turn interactions as coherent units, making it the right tool for testing chatbots, customer support agents, and any application where conversation history shapes behavior.
This guide walks through Literal AI's core capabilities — thread tracking, dataset creation, and scoring — and shows how to integrate them into a real testing workflow.
The Conversational Testing Problem
Consider a customer support chatbot with this conversation:
- User: "I'd like to cancel my subscription"
- Bot: "I'm sorry to hear that. What's the main reason you're considering cancellation?"
- User: "It's too expensive"
- Bot: "I understand. We do have a more affordable plan — would you like me to check if you qualify for our promotional pricing?"
- User: "Sure"
- Bot: now needs to look up the user's account and eligibility
If you test only the last exchange in isolation — "Sure" → "Here's your promotional offer" — you miss the context. The bot's response to "Sure" is only correct if it remembered the entire conversation that preceded it. Testing it requires the full thread.
Literal AI's thread abstraction solves this directly.
Setup
Install the SDK:
pip install literalaiInitialize the client:
from literalai import LiteralClient
lai = LiteralClient(api_key="your-api-key")Thread and Step Tracking
A Thread in Literal AI represents a full conversation. Inside a thread, Steps represent individual turns or pipeline stages. The nesting gives you complete context for every exchange.
from literalai import LiteralClient
import openai
lai = LiteralClient(api_key="your-api-key")
oai = openai.OpenAI()
def run_support_chat(conversation_history: list[dict], user_message: str) -> str:
"""Run one turn of the support chatbot."""
# Add the new user message to history
conversation_history.append({
"role": "user",
"content": user_message
})
response = oai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
*conversation_history
]
)
assistant_message = response.choices[0].message.content
conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
def run_support_session(user_id: str, initial_message: str):
"""Run a full support session, tracked as a Literal AI thread."""
with lai.thread(name=f"Support Session - {user_id}",
metadata={"user_id": user_id, "channel": "web"}) as thread:
conversation_history = []
current_message = initial_message
# Track the first user message as a step
with lai.step(name="user-message", type="user_message") as step:
step.input = {"message": current_message}
# Run the conversation loop (in a real app, this would be event-driven)
for turn in range(5): # Max 5 turns
with lai.step(name=f"turn-{turn}", type="llm") as step:
step.input = {
"history": conversation_history,
"user_message": current_message
}
response = run_support_chat(conversation_history, current_message)
step.output = {"response": response}
step.metadata = {
"turn": turn,
"history_length": len(conversation_history)
}
print(f"Bot: {response}")
# Simulate user follow-up (in real app, this is the next event)
if turn == 0:
current_message = "It's too expensive"
elif turn == 1:
current_message = "Sure, check my eligibility"
else:
break
return thread.idThe thread in the Literal AI UI shows the full conversation with all steps, making it easy to see exactly how context accumulated and where the conversation went wrong.
Using the Instrumented OpenAI Client
For automatic step tracking without manual context managers, Literal AI provides an instrumented OpenAI client:
from literalai import LiteralClient
lai = LiteralClient(api_key="your-api-key")
oai = lai.instrumented_openai(openai.OpenAI())
# All calls through oai are automatically tracked as steps
# within whatever thread is currently active
with lai.thread(name="automated-support") as thread:
response = oai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello, I need help"}]
)The instrumented client captures token counts, latency, and cost for each LLM call automatically.
Building Evaluation Datasets from Threads
Once you have production threads in Literal AI, you can select the interesting ones and add them to evaluation datasets. This is the core workflow for building regression suites: encounter a bug in production, add that thread to a dataset, never regress on it again.
from literalai import LiteralClient
lai = LiteralClient(api_key="your-api-key")
# Create a dataset for conversational test cases
dataset = lai.api.create_dataset(
name="support-chatbot-v1",
description="Regression suite for the customer support chatbot",
type="key_value"
)
# Add items to the dataset
items = [
{
"input": {
"conversation": [
{"role": "user", "content": "I want to cancel"},
{"role": "assistant", "content": "I'm sorry to hear that. What's the reason?"},
{"role": "user", "content": "Too expensive"},
{"role": "assistant", "content": "Would you like to explore our more affordable plan?"},
{"role": "user", "content": "Yes"}
]
},
"expected_output": {
"contains": ["promotional", "plan", "discount"],
"should_not_contain": ["cancel", "sorry to see you go"],
"action": "check_promotion_eligibility"
},
"metadata": {
"category": "retention",
"source": "production-failure",
"ticket": "TKT-8823"
}
},
{
"input": {
"conversation": [
{"role": "user", "content": "What's your refund policy?"}
]
},
"expected_output": {
"contains": ["30 days", "full refund"],
"action": "provide_policy_info"
},
"metadata": {
"category": "policy",
"source": "manual",
"difficulty": "easy"
}
}
]
for item in items:
lai.api.create_dataset_item(
dataset_id=dataset.id,
input=item["input"],
expected_output=item["expected_output"],
metadata=item["metadata"]
)
print(f"Dataset created: {dataset.id}")Scoring Functions
Literal AI's scoring system evaluates dataset items and production traces against defined quality criteria.
Manual Scoring (Human Feedback)
def add_human_score(step_id: str, score_value: float, comment: str):
"""Add a human evaluation score to a specific step."""
lai.api.create_score(
step_id=step_id,
name="human-quality",
type="HUMAN",
value=score_value, # 0 to 1
comment=comment
)
# After reviewing a conversation thread:
add_human_score(
step_id="step-abc123",
score_value=0.9,
comment="Response was helpful and on-topic, minor phrasing issue"
)Automated Scoring with LLM-as-Judge
import openai
from literalai import LiteralClient
lai = LiteralClient(api_key="your-api-key")
oai = openai.OpenAI()
def score_retention_handling(
conversation: list[dict],
response: str,
step_id: str
) -> float:
"""Use GPT-4 to score how well the bot handled a retention scenario."""
conversation_text = "\n".join([
f"{msg['role'].upper()}: {msg['content']}"
for msg in conversation
])
judge_prompt = f"""You are evaluating a customer support chatbot's retention handling.
Conversation history:
{conversation_text}
Bot's latest response: {response}
Rate the bot's retention handling on a scale of 0 to 1:
- 1.0: Excellently handled retention — showed empathy, offered alternatives, didn't push cancellation
- 0.7: Good handling — addressed the concern but missed an opportunity
- 0.4: Mediocre — acknowledged but didn't actively try to retain
- 0.1: Poor — facilitated cancellation without attempting retention
Return only a decimal number between 0 and 1."""
judge_response = oai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}]
)
score_value = float(judge_response.choices[0].message.content.strip())
# Record the score in Literal AI
lai.api.create_score(
step_id=step_id,
name="retention-handling",
type="AI",
value=score_value,
comment=f"AI judge scored retention handling at {score_value:.2f}"
)
return score_valueRunning a Dataset Evaluation Experiment
def run_experiment(dataset_id: str, experiment_name: str):
"""Run the chatbot against the dataset and score all outputs."""
items = lai.api.get_dataset_items(dataset_id=dataset_id)
scores = []
for item in items:
conversation = item.input["conversation"]
expected = item.expected_output
# Run the chatbot on the conversation
with lai.thread(name=f"eval-{experiment_name}") as thread:
with lai.step(name="eval-turn", type="llm") as step:
step.input = {"conversation": conversation}
# Get the bot's response to the conversation
response = run_support_chat(conversation, conversation[-1]["content"])
step.output = {"response": response}
# Score the response
score = evaluate_response(response, expected, step_id=step.id)
scores.append(score)
avg_score = sum(scores) / len(scores) if scores else 0
print(f"Experiment '{experiment_name}': avg score = {avg_score:.3f}")
return avg_score
def evaluate_response(response: str, expected: dict, step_id: str) -> float:
"""Multi-factor scoring of a chatbot response."""
scores = []
# Check required content
if "contains" in expected:
hits = sum(1 for kw in expected["contains"] if kw.lower() in response.lower())
content_score = hits / len(expected["contains"])
scores.append(content_score)
lai.api.create_score(
step_id=step_id,
name="required-content",
type="AI",
value=content_score,
comment=f"Found {hits}/{len(expected['contains'])} required keywords"
)
# Check prohibited content
if "should_not_contain" in expected:
violations = [
kw for kw in expected["should_not_contain"]
if kw.lower() in response.lower()
]
safety_score = 1.0 if not violations else 0.0
scores.append(safety_score)
lai.api.create_score(
step_id=step_id,
name="content-safety",
type="AI",
value=safety_score,
comment=f"Violations: {violations}" if violations else "No violations"
)
return sum(scores) / len(scores) if scores else 0.0Testing Multi-Turn Context Retention
A key test for conversational AI: does the bot remember what was said earlier in the thread?
import pytest
from literalai import LiteralClient
lai = LiteralClient(api_key="your-api-key")
def test_context_retention_across_turns():
"""Verify the bot correctly uses context from earlier in the conversation."""
# Conversation where turn 3 requires knowing what was said in turn 1
conversation_progression = [
"My account number is 12345", # Turn 1: provides account
"What's the weather like?", # Turn 2: off-topic
"What account number did I just tell you?", # Turn 3: tests memory
]
conversation_history = []
responses = []
with lai.thread(name="context-retention-test") as thread:
for message in conversation_progression:
with lai.step(name="turn") as step:
step.input = {"message": message, "history": conversation_history}
response = run_support_chat(conversation_history, message)
step.output = {"response": response}
responses.append(response)
# The bot should recall "12345" in response to the third message
final_response = responses[-1]
assert "12345" in final_response, (
f"Bot failed to recall account number from turn 1. "
f"Got: '{final_response}'. "
f"Thread: https://cloud.literal.ai/threads/{thread.id}"
)
def test_intent_continuity():
"""Verify the bot maintains the intent context across topic switches."""
with lai.thread(name="intent-continuity-test") as thread:
# Establish a refund request
with lai.step(name="turn-1") as step:
h1 = []
r1 = run_support_chat(h1, "I need a refund for my purchase from last week")
step.output = {"response": r1}
# Slightly off-topic
with lai.step(name="turn-2") as step:
r2 = run_support_chat(h1, "Actually, what IS your refund policy exactly?")
step.output = {"response": r2}
# Return to the original request
with lai.step(name="turn-3") as step:
r3 = run_support_chat(h1, "OK, so can I get my refund?")
step.output = {"response": r3}
# Turn 3 should still be about the refund, not asking for details again
refund_keywords = ["refund", "process", "initiate", "submit"]
has_refund_content = any(kw in r3.lower() for kw in refund_keywords)
assert has_refund_content, (
f"Bot lost intent context. Turn 3 response: '{r3}'. "
f"Thread: https://cloud.literal.ai/threads/{thread.id}"
)CI Integration
# .github/workflows/conversational-ai-tests.yml
name: Conversational AI Tests
on:
pull_request:
paths:
- 'src/chatbot/**'
- 'prompts/**'
jobs:
dataset-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install literalai openai pytest
- name: Run conversation tests
run: pytest tests/conversation/ -v
env:
LITERAL_API_KEY: ${{ secrets.LITERAL_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Run dataset evaluation
run: |
python scripts/run_literal_eval.py \
--dataset support-chatbot-v1 \
--experiment ${{ github.sha }} \
--min-score 0.80
env:
LITERAL_API_KEY: ${{ secrets.LITERAL_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Integrating with Application-Level Testing
Literal AI covers conversational AI evaluation — the quality of the LLM's responses across multi-turn threads. It doesn't cover how your chat UI behaves, how messages are rendered, or how the application handles network errors or timeouts.
HelpMeTest fills that gap with Robot Framework + Playwright automation. A complete test strategy for a conversational AI product looks like:
- Literal AI dataset eval — does the chatbot give correct responses across regression test cases?
- HelpMeTest E2E tests — does the chat UI load correctly, can users send messages, are responses displayed properly, does the typing indicator work?
- Production thread monitoring — are real user threads getting low scores that indicate model drift?
The Literal AI dashboard gives you the AI quality view. HelpMeTest gives you the user experience view. Both are necessary for shipping confidently.
Key Concepts Recap
| Concept | What It Is | When You Use It |
|---|---|---|
| Thread | A complete conversation session | Always — every conversation is a thread |
| Step | A single action within a thread | LLM calls, tool calls, retrieval steps |
| Dataset | Collection of test conversations | Regression testing |
| Score | Quality rating for a step or thread | Automated and human evaluation |
| Experiment | A run of your pipeline against a dataset | Before-after comparison of changes |
Summary
Literal AI's thread-centric design makes it the right fit for testing conversational AI. By capturing full conversation context — not just individual LLM calls — it enables you to test whether your chatbot correctly uses context, maintains intent across topic switches, and responds appropriately at every turn based on what came before.
The dataset and scoring system provides the regression safety net: once you identify a failure in production, you add it to the dataset and it becomes a permanent test case. The CI integration ensures that future changes can't break those cases silently. Combined with end-to-end application testing, it's a complete quality foundation for AI-powered conversational products.