Testing LLM Structured Outputs: JSON Mode, Schemas, and Validation
Structured outputs (JSON mode, response schemas) make LLMs return parseable data instead of free text. They don't make LLMs deterministic. Testing structured outputs means validating schema compliance, semantic correctness, edge case handling, and behavior when the model can't produce valid output — none of which JSON validation alone covers.
Key Takeaways
Schema compliance is necessary but not sufficient. A response that's valid JSON matching your schema can still be semantically wrong. {"sentiment": "positive", "score": -0.9} passes schema validation but is contradictory.
Test the failure modes, not just the happy path. What does your app do when the model returns null for a required field? When it returns a valid schema with nonsense values? When it hits max tokens mid-JSON? These are the production failures.
Pydantic + instructor is the current standard for Python. The instructor library patches the OpenAI client to automatically retry on validation failures, giving you typed Python objects back instead of raw JSON strings.
Run schema validation at the API boundary, not just in tests. Use JSON Schema or Pydantic validation every time you receive a structured output in production, not just during development.
Why Structured Output Testing Is Different
Getting an LLM to return JSON seems simple — add "respond in JSON" to your prompt, parse the response. The problems:
- Models sometimes return valid JSON that fails your business logic — correct structure, wrong values
- Edge case inputs produce edge case outputs — unusual inputs can produce malformed JSON even with strict schemas
- Model version updates break output formats — gpt-4o-mini generates differently structured JSON than gpt-4o
- Token limits truncate mid-JSON — never caught until you hit a long document in production
Testing structured outputs means testing all of these, not just "is it valid JSON?"
Setting Up Structured Outputs
OpenAI JSON Mode
from openai import OpenAI
import json
client = OpenAI()
def extract_product_info(description: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": """Extract product information as JSON with this structure:
{
"name": "product name",
"price": 0.00,
"currency": "USD",
"in_stock": true,
"category": "category name"
}"""
},
{"role": "user", "content": description}
]
)
return json.loads(response.choices[0].message.content)OpenAI Strict Structured Outputs (GPT-4o+)
from pydantic import BaseModel
from typing import Optional
class ProductInfo(BaseModel):
name: str
price: float
currency: str
in_stock: bool
category: str
tags: list[str] = []
def extract_product_strict(description: str) -> ProductInfo:
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract product info from the description."},
{"role": "user", "content": description}
],
response_format=ProductInfo
)
return response.choices[0].message.parsedInstructor (Recommended for Production)
import instructor
from openai import OpenAI
from pydantic import BaseModel, validator
client = instructor.from_openai(OpenAI())
class SentimentAnalysis(BaseModel):
sentiment: str # "positive", "negative", "neutral"
score: float # -1.0 to 1.0
reasoning: str
@validator("sentiment")
def sentiment_must_be_valid(cls, v):
if v not in ["positive", "negative", "neutral"]:
raise ValueError(f"Invalid sentiment: {v}")
return v
@validator("score")
def score_must_be_in_range(cls, v):
if not -1.0 <= v <= 1.0:
raise ValueError(f"Score out of range: {v}")
return v
def analyze_sentiment(text: str) -> SentimentAnalysis:
return client.chat.completions.create(
model="gpt-4o",
response_model=SentimentAnalysis,
messages=[{"role": "user", "content": f"Analyze the sentiment: {text}"}]
)Testing Schema Compliance
Basic Schema Tests
import pytest
import jsonschema
PRODUCT_SCHEMA = {
"type": "object",
"required": ["name", "price", "currency", "in_stock", "category"],
"properties": {
"name": {"type": "string", "minLength": 1},
"price": {"type": "number", "minimum": 0},
"currency": {"type": "string", "pattern": "^[A-Z]{3}$"},
"in_stock": {"type": "boolean"},
"category": {"type": "string"}
},
"additionalProperties": False
}
def validate_product_schema(data: dict) -> None:
jsonschema.validate(instance=data, schema=PRODUCT_SCHEMA)
@pytest.mark.llm
def test_product_extraction_matches_schema():
result = extract_product_info(
"Nike Air Max 90 running shoes, $129.99, available in sizes 8-13"
)
validate_product_schema(result) # Raises on failure
@pytest.mark.llm
def test_product_price_is_numeric():
result = extract_product_info("Laptop for $999")
assert isinstance(result["price"], (int, float))
assert result["price"] > 0
@pytest.mark.llm
def test_currency_is_uppercase_iso():
result = extract_product_info("Widget costs 49.99 USD")
assert result["currency"] == "USD"
assert len(result["currency"]) == 3Semantic Correctness Tests
Schema validation catches structure, not meaning. Add semantic assertions:
@pytest.mark.llm
def test_positive_review_has_positive_sentiment():
result = analyze_sentiment(
"This product is absolutely amazing! Best purchase I've ever made."
)
assert result.sentiment == "positive"
assert result.score > 0.5, f"Expected score > 0.5, got {result.score}"
@pytest.mark.llm
def test_sentiment_score_matches_label():
"""Score direction must match sentiment label — a common model inconsistency"""
result = analyze_sentiment("This is the worst product I've ever used.")
assert result.sentiment == "negative"
assert result.score < 0, (
f"Negative sentiment should have negative score, got {result.score}"
)
@pytest.mark.llm
def test_neutral_text_produces_neutral_result():
result = analyze_sentiment(
"The product arrived on Tuesday. It was in a brown box."
)
assert result.sentiment == "neutral"
assert -0.3 <= result.score <= 0.3Testing Edge Cases
@pytest.mark.llm
def test_empty_input_returns_structured_response():
"""Model should return valid structure even for minimal input"""
result = extract_product_info("Widget")
validate_product_schema(result)
# Name should be populated
assert result["name"] is not None
assert len(result["name"]) > 0
@pytest.mark.llm
def test_non_product_input_is_handled():
"""Input that's not a product description — what does the model do?"""
result = extract_product_info(
"The weather today is sunny with a high of 72 degrees."
)
# Model should still return valid schema
validate_product_schema(result)
# Price should be 0 or null for non-product text
# (depends on your prompt — test whatever behavior you specified)
@pytest.mark.llm
def test_multilingual_input():
result = extract_product_info("Casque audio Sony WH-1000XM5, 349€, disponible")
validate_product_schema(result)
assert result["price"] == 349.0
assert result["currency"] == "EUR"
@pytest.mark.llm
def test_very_long_product_description():
"""Long inputs should not truncate mid-JSON"""
long_description = "This amazing product " + ("does many great things " * 500)
result = extract_product_info(long_description)
validate_product_schema(result)Testing Instructor Retry Behavior
The instructor library retries when Pydantic validation fails. Test that this works:
import instructor
from unittest.mock import patch, MagicMock
def test_instructor_retries_on_invalid_schema():
"""Instructor should retry when model returns invalid output"""
call_count = 0
original_create = instructor_client.chat.completions.create
def count_calls(*args, **kwargs):
nonlocal call_count
call_count += 1
return original_create(*args, **kwargs)
with patch.object(instructor_client.chat.completions, "create", side_effect=count_calls):
# Use an adversarial input likely to cause validation issues
result = analyze_sentiment("!!!")
# Either succeeded on first try or retried
assert isinstance(result, SentimentAnalysis)
# If call_count > 1, instructor retried — which is correct behavior
def test_instructor_raises_after_max_retries():
"""After max retries, instructor should raise, not return garbage"""
with patch("openai.OpenAI") as mock_openai:
# Force consistently invalid JSON
mock_client = mock_openai.return_value
mock_client.chat.completions.create.return_value = MagicMock(
choices=[MagicMock(
message=MagicMock(content='{"sentiment": "INVALID_VALUE", "score": 999}')
)]
)
client = instructor.from_openai(mock_openai(), max_retries=2)
with pytest.raises(Exception): # instructor.exceptions.InstructorRetryException
client.chat.completions.create(
model="gpt-4o",
response_model=SentimentAnalysis,
messages=[{"role": "user", "content": "test"}]
)Testing Anthropic Structured Outputs
Anthropic doesn't have native JSON mode but Claude follows structured output instructions reliably:
import anthropic
import json
client = anthropic.Anthropic()
def extract_with_claude(text: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="""You are a data extraction API. Always respond with valid JSON only.
No explanation, no markdown, just the JSON object.
Schema: {"entities": [{"name": str, "type": str, "confidence": float}]}""",
messages=[{"role": "user", "content": text}]
)
raw = response.content[0].text
return json.loads(raw) # Will raise if not valid JSON
@pytest.mark.llm
def test_claude_returns_valid_json():
result = extract_with_claude("Apple and Microsoft announced a partnership.")
assert "entities" in result
assert isinstance(result["entities"], list)
assert len(result["entities"]) >= 2
@pytest.mark.llm
def test_claude_entity_types_are_valid():
result = extract_with_claude("Elon Musk visited Tesla's Austin factory.")
entity_types = {e["type"] for e in result["entities"]}
valid_types = {"PERSON", "ORGANIZATION", "LOCATION", "PRODUCT"}
assert entity_types.issubset(valid_types), f"Unexpected types: {entity_types - valid_types}"Regression Testing for Schema Changes
When you update your output schema, you need to verify the model still produces valid outputs. Create a golden test set:
import json
from pathlib import Path
GOLDEN_INPUTS_FILE = Path("tests/fixtures/product_extraction_inputs.json")
GOLDEN_OUTPUTS_FILE = Path("tests/fixtures/product_extraction_expected.json")
def test_schema_regression():
"""Run the same inputs as before — check schema still holds"""
inputs = json.loads(GOLDEN_INPUTS_FILE.read_text())
for test_case in inputs:
result = extract_product_info(test_case["input"])
try:
validate_product_schema(result)
except jsonschema.ValidationError as e:
pytest.fail(f"Schema regression for input '{test_case['id']}': {e.message}")
# Generate golden dataset
def generate_golden_dataset():
test_inputs = [
{"id": "basic_product", "input": "Nike shoes $99"},
{"id": "european_price", "input": "Watch €299 EUR"},
{"id": "out_of_stock", "input": "Sold out - vintage lamp $45"},
]
for item in test_inputs:
item["expected"] = extract_product_info(item["input"])
GOLDEN_INPUTS_FILE.write_text(json.dumps(test_inputs, indent=2))CI Configuration for Structured Output Tests
# .github/workflows/llm-tests.yml
name: LLM Structured Output Tests
on:
push:
branches: [main]
schedule:
- cron: '0 9 * * 1' # Weekly on Monday
jobs:
structured-output-tests:
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install pytest openai instructor pydantic jsonschema
- run: pytest tests/llm/ -v --tb=short
continue-on-error: true # Don't block deploys on LLM flakiness
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: llm-test-results
path: test-results.jsonProduction Validation
Add runtime validation to catch structured output failures in production:
from pydantic import ValidationError
import logging
logger = logging.getLogger(__name__)
def safe_extract_product(description: str) -> ProductInfo | None:
"""Extract with runtime validation and fallback"""
try:
raw = extract_product_info(description)
return ProductInfo(**raw)
except (json.JSONDecodeError, ValidationError) as e:
logger.error(
"Structured output validation failed",
extra={
"error": str(e),
"input_length": len(description),
"model": "gpt-4o"
}
)
return None
except Exception as e:
logger.error("Unexpected structured output failure", exc_info=True)
return NoneTrack your validation failure rate. A rate above 1-2% indicates a prompt or model issue that needs investigation.
Key Takeaways
Structured outputs reduce JSON parsing failures but don't eliminate them — and they don't make your app's logic correct. Test at three levels: schema compliance (is it valid JSON matching your schema?), semantic correctness (are the values meaningful?), and edge case behavior (what happens with empty/unusual inputs?). Use Pydantic validators to encode business rules, not just type constraints. Add runtime validation in production and track your failure rate over time.