LLM Testing

Literal AI Testing Guide: Thread Tracking, Datasets, and Scoring

HelpMeTest

18 May 2026 — 8 min read

Conversational AI applications present a testing challenge that single-turn LLM pipelines don't: the context of a full conversation matters. What a user said three turns ago can determine whether turn seven's response is correct or wrong. Standard observability tools that log individual LLM calls miss this. Literal AI was built with conversational AI in mind — its thread model captures multi-turn interactions as coherent units, making it the right tool for testing chatbots, customer support agents, and any application where conversation history shapes behavior.

This guide walks through Literal AI's core capabilities — thread tracking, dataset creation, and scoring — and shows how to integrate them into a real testing workflow.

The Conversational Testing Problem

Consider a customer support chatbot with this conversation:

User: "I'd like to cancel my subscription"
Bot: "I'm sorry to hear that. What's the main reason you're considering cancellation?"
User: "It's too expensive"
Bot: "I understand. We do have a more affordable plan — would you like me to check if you qualify for our promotional pricing?"
User: "Sure"
Bot: now needs to look up the user's account and eligibility

If you test only the last exchange in isolation — "Sure" → "Here's your promotional offer" — you miss the context. The bot's response to "Sure" is only correct if it remembered the entire conversation that preceded it. Testing it requires the full thread.

Literal AI's thread abstraction solves this directly.

Setup

Install the SDK:

pip install literalai

Initialize the client:

from literalai import LiteralClient

lai = LiteralClient(api_key="your-api-key")

Thread and Step Tracking

A Thread in Literal AI represents a full conversation. Inside a thread, Steps represent individual turns or pipeline stages. The nesting gives you complete context for every exchange.

from literalai import LiteralClient
import openai

lai = LiteralClient(api_key="your-api-key")
oai = openai.OpenAI()

def run_support_chat(conversation_history: list[dict], user_message: str) -> str:
    """Run one turn of the support chatbot."""
    
    # Add the new user message to history
    conversation_history.append({
        "role": "user",
        "content": user_message
    })
    
    response = oai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful customer support agent."},
            *conversation_history
        ]
    )
    
    assistant_message = response.choices[0].message.content
    conversation_history.append({
        "role": "assistant",
        "content": assistant_message
    })
    
    return assistant_message


def run_support_session(user_id: str, initial_message: str):
    """Run a full support session, tracked as a Literal AI thread."""
    
    with lai.thread(name=f"Support Session - {user_id}", 
                    metadata={"user_id": user_id, "channel": "web"}) as thread:
        
        conversation_history = []
        current_message = initial_message
        
        # Track the first user message as a step
        with lai.step(name="user-message", type="user_message") as step:
            step.input = {"message": current_message}
        
        # Run the conversation loop (in a real app, this would be event-driven)
        for turn in range(5):  # Max 5 turns
            with lai.step(name=f"turn-{turn}", type="llm") as step:
                step.input = {
                    "history": conversation_history,
                    "user_message": current_message
                }
                
                response = run_support_chat(conversation_history, current_message)
                
                step.output = {"response": response}
                step.metadata = {
                    "turn": turn,
                    "history_length": len(conversation_history)
                }
            
            print(f"Bot: {response}")
            
            # Simulate user follow-up (in real app, this is the next event)
            if turn == 0:
                current_message = "It's too expensive"
            elif turn == 1:
                current_message = "Sure, check my eligibility"
            else:
                break
        
        return thread.id

The thread in the Literal AI UI shows the full conversation with all steps, making it easy to see exactly how context accumulated and where the conversation went wrong.

Using the Instrumented OpenAI Client

For automatic step tracking without manual context managers, Literal AI provides an instrumented OpenAI client:

from literalai import LiteralClient

lai = LiteralClient(api_key="your-api-key")
oai = lai.instrumented_openai(openai.OpenAI())

# All calls through oai are automatically tracked as steps
# within whatever thread is currently active
with lai.thread(name="automated-support") as thread:
    response = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello, I need help"}]
    )

The instrumented client captures token counts, latency, and cost for each LLM call automatically.

Building Evaluation Datasets from Threads

Once you have production threads in Literal AI, you can select the interesting ones and add them to evaluation datasets. This is the core workflow for building regression suites: encounter a bug in production, add that thread to a dataset, never regress on it again.

from literalai import LiteralClient

lai = LiteralClient(api_key="your-api-key")

# Create a dataset for conversational test cases
dataset = lai.api.create_dataset(
    name="support-chatbot-v1",
    description="Regression suite for the customer support chatbot",
    type="key_value"
)

# Add items to the dataset
items = [
    {
        "input": {
            "conversation": [
                {"role": "user", "content": "I want to cancel"},
                {"role": "assistant", "content": "I'm sorry to hear that. What's the reason?"},
                {"role": "user", "content": "Too expensive"},
                {"role": "assistant", "content": "Would you like to explore our more affordable plan?"},
                {"role": "user", "content": "Yes"}
            ]
        },
        "expected_output": {
            "contains": ["promotional", "plan", "discount"],
            "should_not_contain": ["cancel", "sorry to see you go"],
            "action": "check_promotion_eligibility"
        },
        "metadata": {
            "category": "retention",
            "source": "production-failure",
            "ticket": "TKT-8823"
        }
    },
    {
        "input": {
            "conversation": [
                {"role": "user", "content": "What's your refund policy?"}
            ]
        },
        "expected_output": {
            "contains": ["30 days", "full refund"],
            "action": "provide_policy_info"
        },
        "metadata": {
            "category": "policy",
            "source": "manual",
            "difficulty": "easy"
        }
    }
]

for item in items:
    lai.api.create_dataset_item(
        dataset_id=dataset.id,
        input=item["input"],
        expected_output=item["expected_output"],
        metadata=item["metadata"]
    )

print(f"Dataset created: {dataset.id}")

Scoring Functions

Literal AI's scoring system evaluates dataset items and production traces against defined quality criteria.

Manual Scoring (Human Feedback)

def add_human_score(step_id: str, score_value: float, comment: str):
    """Add a human evaluation score to a specific step."""
    lai.api.create_score(
        step_id=step_id,
        name="human-quality",
        type="HUMAN",
        value=score_value,  # 0 to 1
        comment=comment
    )

# After reviewing a conversation thread:
add_human_score(
    step_id="step-abc123",
    score_value=0.9,
    comment="Response was helpful and on-topic, minor phrasing issue"
)

Automated Scoring with LLM-as-Judge

import openai
from literalai import LiteralClient

lai = LiteralClient(api_key="your-api-key")
oai = openai.OpenAI()

def score_retention_handling(
    conversation: list[dict], 
    response: str,
    step_id: str
) -> float:
    """Use GPT-4 to score how well the bot handled a retention scenario."""
    
    conversation_text = "\n".join([
        f"{msg['role'].upper()}: {msg['content']}" 
        for msg in conversation
    ])
    
    judge_prompt = f"""You are evaluating a customer support chatbot's retention handling.

Conversation history:
{conversation_text}

Bot's latest response: {response}

Rate the bot's retention handling on a scale of 0 to 1:
- 1.0: Excellently handled retention — showed empathy, offered alternatives, didn't push cancellation
- 0.7: Good handling — addressed the concern but missed an opportunity
- 0.4: Mediocre — acknowledged but didn't actively try to retain
- 0.1: Poor — facilitated cancellation without attempting retention

Return only a decimal number between 0 and 1."""
    
    judge_response = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}]
    )
    
    score_value = float(judge_response.choices[0].message.content.strip())
    
    # Record the score in Literal AI
    lai.api.create_score(
        step_id=step_id,
        name="retention-handling",
        type="AI",
        value=score_value,
        comment=f"AI judge scored retention handling at {score_value:.2f}"
    )
    
    return score_value

Running a Dataset Evaluation Experiment

def run_experiment(dataset_id: str, experiment_name: str):
    """Run the chatbot against the dataset and score all outputs."""
    
    items = lai.api.get_dataset_items(dataset_id=dataset_id)
    scores = []
    
    for item in items:
        conversation = item.input["conversation"]
        expected = item.expected_output
        
        # Run the chatbot on the conversation
        with lai.thread(name=f"eval-{experiment_name}") as thread:
            with lai.step(name="eval-turn", type="llm") as step:
                step.input = {"conversation": conversation}
                
                # Get the bot's response to the conversation
                response = run_support_chat(conversation, conversation[-1]["content"])
                step.output = {"response": response}
                
                # Score the response
                score = evaluate_response(response, expected, step_id=step.id)
                scores.append(score)
    
    avg_score = sum(scores) / len(scores) if scores else 0
    print(f"Experiment '{experiment_name}': avg score = {avg_score:.3f}")
    return avg_score

def evaluate_response(response: str, expected: dict, step_id: str) -> float:
    """Multi-factor scoring of a chatbot response."""
    
    scores = []
    
    # Check required content
    if "contains" in expected:
        hits = sum(1 for kw in expected["contains"] if kw.lower() in response.lower())
        content_score = hits / len(expected["contains"])
        scores.append(content_score)
        
        lai.api.create_score(
            step_id=step_id,
            name="required-content",
            type="AI",
            value=content_score,
            comment=f"Found {hits}/{len(expected['contains'])} required keywords"
        )
    
    # Check prohibited content
    if "should_not_contain" in expected:
        violations = [
            kw for kw in expected["should_not_contain"] 
            if kw.lower() in response.lower()
        ]
        safety_score = 1.0 if not violations else 0.0
        scores.append(safety_score)
        
        lai.api.create_score(
            step_id=step_id,
            name="content-safety",
            type="AI",
            value=safety_score,
            comment=f"Violations: {violations}" if violations else "No violations"
        )
    
    return sum(scores) / len(scores) if scores else 0.0

Testing Multi-Turn Context Retention

A key test for conversational AI: does the bot remember what was said earlier in the thread?

import pytest
from literalai import LiteralClient

lai = LiteralClient(api_key="your-api-key")

def test_context_retention_across_turns():
    """Verify the bot correctly uses context from earlier in the conversation."""
    
    # Conversation where turn 3 requires knowing what was said in turn 1
    conversation_progression = [
        "My account number is 12345",                          # Turn 1: provides account
        "What's the weather like?",                            # Turn 2: off-topic
        "What account number did I just tell you?",            # Turn 3: tests memory
    ]
    
    conversation_history = []
    responses = []
    
    with lai.thread(name="context-retention-test") as thread:
        for message in conversation_progression:
            with lai.step(name="turn") as step:
                step.input = {"message": message, "history": conversation_history}
                response = run_support_chat(conversation_history, message)
                step.output = {"response": response}
                responses.append(response)
    
    # The bot should recall "12345" in response to the third message
    final_response = responses[-1]
    assert "12345" in final_response, (
        f"Bot failed to recall account number from turn 1. "
        f"Got: '{final_response}'. "
        f"Thread: https://cloud.literal.ai/threads/{thread.id}"
    )

def test_intent_continuity():
    """Verify the bot maintains the intent context across topic switches."""
    
    with lai.thread(name="intent-continuity-test") as thread:
        # Establish a refund request
        with lai.step(name="turn-1") as step:
            h1 = []
            r1 = run_support_chat(h1, "I need a refund for my purchase from last week")
            step.output = {"response": r1}
        
        # Slightly off-topic
        with lai.step(name="turn-2") as step:
            r2 = run_support_chat(h1, "Actually, what IS your refund policy exactly?")
            step.output = {"response": r2}
        
        # Return to the original request
        with lai.step(name="turn-3") as step:
            r3 = run_support_chat(h1, "OK, so can I get my refund?")
            step.output = {"response": r3}
        
        # Turn 3 should still be about the refund, not asking for details again
        refund_keywords = ["refund", "process", "initiate", "submit"]
        has_refund_content = any(kw in r3.lower() for kw in refund_keywords)
        
        assert has_refund_content, (
            f"Bot lost intent context. Turn 3 response: '{r3}'. "
            f"Thread: https://cloud.literal.ai/threads/{thread.id}"
        )

CI Integration

# .github/workflows/conversational-ai-tests.yml
name: Conversational AI Tests

on:
  pull_request:
    paths:
      - 'src/chatbot/**'
      - 'prompts/**'

jobs:
  dataset-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: pip install literalai openai pytest
      
      - name: Run conversation tests
        run: pytest tests/conversation/ -v
        env:
          LITERAL_API_KEY: ${{ secrets.LITERAL_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Run dataset evaluation
        run: |
          python scripts/run_literal_eval.py \
            --dataset support-chatbot-v1 \
            --experiment ${{ github.sha }} \
            --min-score 0.80
        env:
          LITERAL_API_KEY: ${{ secrets.LITERAL_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Integrating with Application-Level Testing

Literal AI covers conversational AI evaluation — the quality of the LLM's responses across multi-turn threads. It doesn't cover how your chat UI behaves, how messages are rendered, or how the application handles network errors or timeouts.

HelpMeTest fills that gap with Robot Framework + Playwright automation. A complete test strategy for a conversational AI product looks like:

Literal AI dataset eval — does the chatbot give correct responses across regression test cases?
HelpMeTest E2E tests — does the chat UI load correctly, can users send messages, are responses displayed properly, does the typing indicator work?
Production thread monitoring — are real user threads getting low scores that indicate model drift?

The Literal AI dashboard gives you the AI quality view. HelpMeTest gives you the user experience view. Both are necessary for shipping confidently.

Key Concepts Recap

Concept	What It Is	When You Use It
Thread	A complete conversation session	Always — every conversation is a thread
Step	A single action within a thread	LLM calls, tool calls, retrieval steps
Dataset	Collection of test conversations	Regression testing
Score	Quality rating for a step or thread	Automated and human evaluation
Experiment	A run of your pipeline against a dataset	Before-after comparison of changes

Summary

Literal AI's thread-centric design makes it the right fit for testing conversational AI. By capturing full conversation context — not just individual LLM calls — it enables you to test whether your chatbot correctly uses context, maintains intent across topic switches, and responds appropriately at every turn based on what came before.

The dataset and scoring system provides the regression safety net: once you identify a failure in production, you add it to the dataset and it becomes a permanent test case. The CI integration ensures that future changes can't break those cases silently. Combined with end-to-end application testing, it's a complete quality foundation for AI-powered conversational products.