Testing LLM Structured Outputs: JSON Mode, Schemas, and Validation

Testing LLM Structured Outputs: JSON Mode, Schemas, and Validation

Structured outputs (JSON mode, response schemas) make LLMs return parseable data instead of free text. They don't make LLMs deterministic. Testing structured outputs means validating schema compliance, semantic correctness, edge case handling, and behavior when the model can't produce valid output — none of which JSON validation alone covers.

Key Takeaways

Schema compliance is necessary but not sufficient. A response that's valid JSON matching your schema can still be semantically wrong. {"sentiment": "positive", "score": -0.9} passes schema validation but is contradictory.

Test the failure modes, not just the happy path. What does your app do when the model returns null for a required field? When it returns a valid schema with nonsense values? When it hits max tokens mid-JSON? These are the production failures.

Pydantic + instructor is the current standard for Python. The instructor library patches the OpenAI client to automatically retry on validation failures, giving you typed Python objects back instead of raw JSON strings.

Run schema validation at the API boundary, not just in tests. Use JSON Schema or Pydantic validation every time you receive a structured output in production, not just during development.

Why Structured Output Testing Is Different

Getting an LLM to return JSON seems simple — add "respond in JSON" to your prompt, parse the response. The problems:

  1. Models sometimes return valid JSON that fails your business logic — correct structure, wrong values
  2. Edge case inputs produce edge case outputs — unusual inputs can produce malformed JSON even with strict schemas
  3. Model version updates break output formats — gpt-4o-mini generates differently structured JSON than gpt-4o
  4. Token limits truncate mid-JSON — never caught until you hit a long document in production

Testing structured outputs means testing all of these, not just "is it valid JSON?"

Setting Up Structured Outputs

OpenAI JSON Mode

from openai import OpenAI
import json

client = OpenAI()

def extract_product_info(description: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": """Extract product information as JSON with this structure:
                {
                    "name": "product name",
                    "price": 0.00,
                    "currency": "USD",
                    "in_stock": true,
                    "category": "category name"
                }"""
            },
            {"role": "user", "content": description}
        ]
    )
    return json.loads(response.choices[0].message.content)

OpenAI Strict Structured Outputs (GPT-4o+)

from pydantic import BaseModel
from typing import Optional

class ProductInfo(BaseModel):
    name: str
    price: float
    currency: str
    in_stock: bool
    category: str
    tags: list[str] = []

def extract_product_strict(description: str) -> ProductInfo:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Extract product info from the description."},
            {"role": "user", "content": description}
        ],
        response_format=ProductInfo
    )
    return response.choices[0].message.parsed
import instructor
from openai import OpenAI
from pydantic import BaseModel, validator

client = instructor.from_openai(OpenAI())

class SentimentAnalysis(BaseModel):
    sentiment: str  # "positive", "negative", "neutral"
    score: float    # -1.0 to 1.0
    reasoning: str
    
    @validator("sentiment")
    def sentiment_must_be_valid(cls, v):
        if v not in ["positive", "negative", "neutral"]:
            raise ValueError(f"Invalid sentiment: {v}")
        return v
    
    @validator("score")
    def score_must_be_in_range(cls, v):
        if not -1.0 <= v <= 1.0:
            raise ValueError(f"Score out of range: {v}")
        return v

def analyze_sentiment(text: str) -> SentimentAnalysis:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=SentimentAnalysis,
        messages=[{"role": "user", "content": f"Analyze the sentiment: {text}"}]
    )

Testing Schema Compliance

Basic Schema Tests

import pytest
import jsonschema

PRODUCT_SCHEMA = {
    "type": "object",
    "required": ["name", "price", "currency", "in_stock", "category"],
    "properties": {
        "name": {"type": "string", "minLength": 1},
        "price": {"type": "number", "minimum": 0},
        "currency": {"type": "string", "pattern": "^[A-Z]{3}$"},
        "in_stock": {"type": "boolean"},
        "category": {"type": "string"}
    },
    "additionalProperties": False
}

def validate_product_schema(data: dict) -> None:
    jsonschema.validate(instance=data, schema=PRODUCT_SCHEMA)

@pytest.mark.llm
def test_product_extraction_matches_schema():
    result = extract_product_info(
        "Nike Air Max 90 running shoes, $129.99, available in sizes 8-13"
    )
    validate_product_schema(result)  # Raises on failure

@pytest.mark.llm
def test_product_price_is_numeric():
    result = extract_product_info("Laptop for $999")
    assert isinstance(result["price"], (int, float))
    assert result["price"] > 0

@pytest.mark.llm
def test_currency_is_uppercase_iso():
    result = extract_product_info("Widget costs 49.99 USD")
    assert result["currency"] == "USD"
    assert len(result["currency"]) == 3

Semantic Correctness Tests

Schema validation catches structure, not meaning. Add semantic assertions:

@pytest.mark.llm
def test_positive_review_has_positive_sentiment():
    result = analyze_sentiment(
        "This product is absolutely amazing! Best purchase I've ever made."
    )
    assert result.sentiment == "positive"
    assert result.score > 0.5, f"Expected score > 0.5, got {result.score}"

@pytest.mark.llm
def test_sentiment_score_matches_label():
    """Score direction must match sentiment label — a common model inconsistency"""
    result = analyze_sentiment("This is the worst product I've ever used.")
    assert result.sentiment == "negative"
    assert result.score < 0, (
        f"Negative sentiment should have negative score, got {result.score}"
    )

@pytest.mark.llm
def test_neutral_text_produces_neutral_result():
    result = analyze_sentiment(
        "The product arrived on Tuesday. It was in a brown box."
    )
    assert result.sentiment == "neutral"
    assert -0.3 <= result.score <= 0.3

Testing Edge Cases

@pytest.mark.llm
def test_empty_input_returns_structured_response():
    """Model should return valid structure even for minimal input"""
    result = extract_product_info("Widget")
    validate_product_schema(result)
    # Name should be populated
    assert result["name"] is not None
    assert len(result["name"]) > 0

@pytest.mark.llm
def test_non_product_input_is_handled():
    """Input that's not a product description — what does the model do?"""
    result = extract_product_info(
        "The weather today is sunny with a high of 72 degrees."
    )
    # Model should still return valid schema
    validate_product_schema(result)
    # Price should be 0 or null for non-product text
    # (depends on your prompt — test whatever behavior you specified)

@pytest.mark.llm
def test_multilingual_input():
    result = extract_product_info("Casque audio Sony WH-1000XM5, 349€, disponible")
    validate_product_schema(result)
    assert result["price"] == 349.0
    assert result["currency"] == "EUR"

@pytest.mark.llm
def test_very_long_product_description():
    """Long inputs should not truncate mid-JSON"""
    long_description = "This amazing product " + ("does many great things " * 500)
    result = extract_product_info(long_description)
    validate_product_schema(result)

Testing Instructor Retry Behavior

The instructor library retries when Pydantic validation fails. Test that this works:

import instructor
from unittest.mock import patch, MagicMock

def test_instructor_retries_on_invalid_schema():
    """Instructor should retry when model returns invalid output"""
    call_count = 0
    original_create = instructor_client.chat.completions.create
    
    def count_calls(*args, **kwargs):
        nonlocal call_count
        call_count += 1
        return original_create(*args, **kwargs)
    
    with patch.object(instructor_client.chat.completions, "create", side_effect=count_calls):
        # Use an adversarial input likely to cause validation issues
        result = analyze_sentiment("!!!")
    
    # Either succeeded on first try or retried
    assert isinstance(result, SentimentAnalysis)
    # If call_count > 1, instructor retried — which is correct behavior

def test_instructor_raises_after_max_retries():
    """After max retries, instructor should raise, not return garbage"""
    with patch("openai.OpenAI") as mock_openai:
        # Force consistently invalid JSON
        mock_client = mock_openai.return_value
        mock_client.chat.completions.create.return_value = MagicMock(
            choices=[MagicMock(
                message=MagicMock(content='{"sentiment": "INVALID_VALUE", "score": 999}')
            )]
        )
        
        client = instructor.from_openai(mock_openai(), max_retries=2)
        
        with pytest.raises(Exception):  # instructor.exceptions.InstructorRetryException
            client.chat.completions.create(
                model="gpt-4o",
                response_model=SentimentAnalysis,
                messages=[{"role": "user", "content": "test"}]
            )

Testing Anthropic Structured Outputs

Anthropic doesn't have native JSON mode but Claude follows structured output instructions reliably:

import anthropic
import json

client = anthropic.Anthropic()

def extract_with_claude(text: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are a data extraction API. Always respond with valid JSON only.
        No explanation, no markdown, just the JSON object.
        Schema: {"entities": [{"name": str, "type": str, "confidence": float}]}""",
        messages=[{"role": "user", "content": text}]
    )
    
    raw = response.content[0].text
    return json.loads(raw)  # Will raise if not valid JSON

@pytest.mark.llm
def test_claude_returns_valid_json():
    result = extract_with_claude("Apple and Microsoft announced a partnership.")
    assert "entities" in result
    assert isinstance(result["entities"], list)
    assert len(result["entities"]) >= 2

@pytest.mark.llm
def test_claude_entity_types_are_valid():
    result = extract_with_claude("Elon Musk visited Tesla's Austin factory.")
    entity_types = {e["type"] for e in result["entities"]}
    valid_types = {"PERSON", "ORGANIZATION", "LOCATION", "PRODUCT"}
    assert entity_types.issubset(valid_types), f"Unexpected types: {entity_types - valid_types}"

Regression Testing for Schema Changes

When you update your output schema, you need to verify the model still produces valid outputs. Create a golden test set:

import json
from pathlib import Path

GOLDEN_INPUTS_FILE = Path("tests/fixtures/product_extraction_inputs.json")
GOLDEN_OUTPUTS_FILE = Path("tests/fixtures/product_extraction_expected.json")

def test_schema_regression():
    """Run the same inputs as before — check schema still holds"""
    inputs = json.loads(GOLDEN_INPUTS_FILE.read_text())
    
    for test_case in inputs:
        result = extract_product_info(test_case["input"])
        
        try:
            validate_product_schema(result)
        except jsonschema.ValidationError as e:
            pytest.fail(f"Schema regression for input '{test_case['id']}': {e.message}")

# Generate golden dataset
def generate_golden_dataset():
    test_inputs = [
        {"id": "basic_product", "input": "Nike shoes $99"},
        {"id": "european_price", "input": "Watch €299 EUR"},
        {"id": "out_of_stock", "input": "Sold out - vintage lamp $45"},
    ]
    
    for item in test_inputs:
        item["expected"] = extract_product_info(item["input"])
    
    GOLDEN_INPUTS_FILE.write_text(json.dumps(test_inputs, indent=2))

CI Configuration for Structured Output Tests

# .github/workflows/llm-tests.yml
name: LLM Structured Output Tests

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 9 * * 1'  # Weekly on Monday

jobs:
  structured-output-tests:
    runs-on: ubuntu-latest
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install pytest openai instructor pydantic jsonschema
      - run: pytest tests/llm/ -v --tb=short
        continue-on-error: true  # Don't block deploys on LLM flakiness
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: llm-test-results
          path: test-results.json

Production Validation

Add runtime validation to catch structured output failures in production:

from pydantic import ValidationError
import logging

logger = logging.getLogger(__name__)

def safe_extract_product(description: str) -> ProductInfo | None:
    """Extract with runtime validation and fallback"""
    try:
        raw = extract_product_info(description)
        return ProductInfo(**raw)
    except (json.JSONDecodeError, ValidationError) as e:
        logger.error(
            "Structured output validation failed",
            extra={
                "error": str(e),
                "input_length": len(description),
                "model": "gpt-4o"
            }
        )
        return None
    except Exception as e:
        logger.error("Unexpected structured output failure", exc_info=True)
        return None

Track your validation failure rate. A rate above 1-2% indicates a prompt or model issue that needs investigation.

Key Takeaways

Structured outputs reduce JSON parsing failures but don't eliminate them — and they don't make your app's logic correct. Test at three levels: schema compliance (is it valid JSON matching your schema?), semantic correctness (are the values meaningful?), and edge case behavior (what happens with empty/unusual inputs?). Use Pydantic validators to encode business rules, not just type constraints. Add runtime validation in production and track your failure rate over time.

Read more