AI Testing

LLM Output Schema Validation: Enforcing Structured JSON from AI Models

HelpMeTest

22 May 2026 — 7 min read

The most common source of production failures in LLM applications isn't a bad answer — it's a bad format. Your code expects {"sentiment": "positive", "score": 0.87} and the model returns The sentiment is positive with a confidence of 87%. JSON.parse throws. Your application crashes.

Structured output validation is the unsexy but critical layer between your LLM and the rest of your application. This guide covers the tools and patterns to make LLM output reliable.

Why LLMs Struggle With Structured Output

LLMs are trained to generate natural language, not structured data. Even with explicit instructions, they can fail in several ways:

Markdown wrapping: ```json\n{"key": "value"}\n``` instead of raw JSON
Trailing text: Valid JSON followed by an explanation
Number formatting: "score": 0.87 becomes "score": "0.87" (string, not float)
Missing fields: Omitting optional-but-expected fields
Extra fields: Adding keys not in your schema
Hallucinated enum values: "status": "completed" when valid values are ["done", "pending"]
Nested structure errors: Putting arrays where objects are expected

The frequency of these errors depends on model capability and prompt quality. Even frontier models fail 2-10% of the time on complex schemas without constrained generation.

Level 1: Prompt Engineering for Structure

Before implementing validation infrastructure, optimize your prompt for reliable output.

Be explicit about format:

EXTRACTION_PROMPT = """Extract the following information from the support ticket.

Return ONLY valid JSON matching this exact schema. No markdown, no explanation, no trailing text.

Schema:
{
  "category": string,           // One of: "billing", "technical", "account", "general"
  "priority": string,           // One of: "low", "medium", "high", "critical"
  "summary": string,            // 1-2 sentence summary
  "extracted_data": {
    "product_mentioned": string | null,
    "error_code": string | null,
    "account_id": string | null
  },
  "requires_escalation": boolean
}

Ticket:
{ticket_content}

JSON response:"""

Use few-shot examples:

EXTRACTION_PROMPT_FEW_SHOT = """Extract structured data from support tickets.

Example input:
"My account #ACC-12345 shows an error code ERR-502 when I try to upgrade my billing plan."

Example output:
{"category": "billing", "priority": "medium", "summary": "Account upgrade failing with ERR-502 error", "extracted_data": {"product_mentioned": null, "error_code": "ERR-502", "account_id": "ACC-12345"}, "requires_escalation": false}

Now extract from this ticket:
{ticket_content}

JSON:"""

Few-shot examples improve format compliance by 20-40% compared to schema-only prompts.

Level 2: JSON Schema Validation

Validate extracted JSON against a schema before using it:

import json
import jsonschema
from jsonschema import validate, ValidationError

RESPONSE_SCHEMA = {
    "type": "object",
    "required": ["category", "priority", "summary", "extracted_data", "requires_escalation"],
    "properties": {
        "category": {
            "type": "string",
            "enum": ["billing", "technical", "account", "general"]
        },
        "priority": {
            "type": "string",
            "enum": ["low", "medium", "high", "critical"]
        },
        "summary": {
            "type": "string",
            "minLength": 10,
            "maxLength": 500
        },
        "extracted_data": {
            "type": "object",
            "required": ["product_mentioned", "error_code", "account_id"],
            "properties": {
                "product_mentioned": {"type": ["string", "null"]},
                "error_code": {"type": ["string", "null"]},
                "account_id": {"type": ["string", "null"]}
            }
        },
        "requires_escalation": {"type": "boolean"}
    },
    "additionalProperties": False
}

def extract_and_validate(llm_response: str) -> dict:
    # Clean common LLM formatting issues
    cleaned = clean_llm_json(llm_response)
    
    # Parse JSON
    try:
        data = json.loads(cleaned)
    except json.JSONDecodeError as e:
        raise ValueError(f"LLM returned invalid JSON: {e}\nRaw response: {llm_response[:500]}")
    
    # Validate against schema
    try:
        validate(instance=data, schema=RESPONSE_SCHEMA)
    except ValidationError as e:
        raise ValueError(f"LLM response failed schema validation: {e.message}")
    
    return data

def clean_llm_json(text: str) -> str:
    """Strip common LLM formatting artifacts from JSON responses."""
    text = text.strip()
    
    # Remove markdown code blocks
    if text.startswith('```'):
        lines = text.split('\n')
        # Remove first line (```json or ```) and last line (```)
        text = '\n'.join(lines[1:-1])
    
    # Find JSON object/array boundaries
    start = text.find('{')
    if start == -1:
        start = text.find('[')
    end = text.rfind('}')
    if end == -1:
        end = text.rfind(']')
    
    if start != -1 and end != -1:
        text = text[start:end+1]
    
    return text.strip()

Level 3: Pydantic Models

Pydantic provides schema validation, type coercion, and clean Python objects in one package:

from pydantic import BaseModel, field_validator, model_validator
from typing import Literal
from enum import Enum

class Priority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class ExtractedData(BaseModel):
    product_mentioned: str | None = None
    error_code: str | None = None
    account_id: str | None = None

class TicketExtraction(BaseModel):
    category: Literal["billing", "technical", "account", "general"]
    priority: Priority
    summary: str
    extracted_data: ExtractedData
    requires_escalation: bool
    
    @field_validator('summary')
    @classmethod
    def summary_not_empty(cls, v):
        if len(v.strip()) < 10:
            raise ValueError('Summary too short — must be at least 10 characters')
        return v.strip()
    
    @model_validator(mode='after')
    def escalation_requires_high_priority(self):
        if self.requires_escalation and self.priority not in (Priority.high, Priority.critical):
            raise ValueError('Escalated tickets must be high or critical priority')
        return self

def parse_ticket_extraction(llm_response: str) -> TicketExtraction:
    cleaned = clean_llm_json(llm_response)
    data = json.loads(cleaned)
    return TicketExtraction.model_validate(data)

# Usage
try:
    result = parse_ticket_extraction(llm_response)
    print(f"Priority: {result.priority}")
    print(f"Needs escalation: {result.requires_escalation}")
except (json.JSONDecodeError, ValueError) as e:
    # Handle validation failure
    handle_parse_failure(llm_response, e)

Level 4: Constrained Generation

For maximum reliability, use constrained generation — forcing the model to output tokens that conform to your schema at the token sampling level.

Using Instructor

Instructor wraps the Anthropic API to return Pydantic models directly:

import instructor
import anthropic
from pydantic import BaseModel

# Patch the Anthropic client
client = instructor.from_anthropic(anthropic.Anthropic())

class TicketExtraction(BaseModel):
    category: Literal["billing", "technical", "account", "general"]
    priority: Literal["low", "medium", "high", "critical"]
    summary: str
    requires_escalation: bool

def extract_ticket(ticket_content: str) -> TicketExtraction:
    result = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Extract structured data from this support ticket:\n\n{ticket_content}"
        }],
        response_model=TicketExtraction,  # Instructor handles the rest
    )
    return result  # Already a TicketExtraction instance

ticket = extract_ticket("My billing failed with error ERR-402 on account ACC-789")
print(f"Category: {ticket.category}")  # billing
print(f"Priority: {ticket.priority}")  # medium

Instructor handles retry logic, schema injection, and response parsing automatically.

Using Outlines (Open-Source Models)

For self-hosted or open-source models, Outlines constrains generation at the logit level:

import outlines
import outlines.models as models
from pydantic import BaseModel

model = models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

class TicketExtraction(BaseModel):
    category: Literal["billing", "technical", "account", "general"]
    priority: Literal["low", "medium", "high", "critical"]
    requires_escalation: bool

# Constrained generation — model CANNOT produce invalid JSON
generator = outlines.generate.json(model, TicketExtraction)

result = generator(
    f"Extract data from: 'My billing failed on account ACC-789'",
)
print(result)  # Guaranteed valid TicketExtraction

Claude's Tool Use as Structured Output

Claude's tool use (function calling) forces structured output naturally:

import anthropic
import json

client = anthropic.Anthropic()

tools = [{
    "name": "extract_ticket_data",
    "description": "Extract structured data from a support ticket",
    "input_schema": {
        "type": "object",
        "required": ["category", "priority", "summary", "requires_escalation"],
        "properties": {
            "category": {
                "type": "string",
                "enum": ["billing", "technical", "account", "general"],
                "description": "Ticket category"
            },
            "priority": {
                "type": "string",
                "enum": ["low", "medium", "high", "critical"]
            },
            "summary": {
                "type": "string",
                "description": "1-2 sentence summary of the issue"
            },
            "requires_escalation": {
                "type": "boolean"
            }
        }
    }
}]

def extract_with_tools(ticket_content: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        tools=tools,
        tool_choice={"type": "any"},  # Force tool use
        messages=[{
            "role": "user",
            "content": f"Extract data from this ticket:\n\n{ticket_content}"
        }]
    )
    
    # Extract tool use block
    for block in response.content:
        if block.type == "tool_use":
            return block.input  # Already parsed JSON matching the schema
    
    raise ValueError("Model did not use the extraction tool")

result = extract_with_tools("Billing failed with ERR-402")
print(result)  # {"category": "billing", "priority": "medium", ...}

Tool use with tool_choice: any forces Claude to output a valid tool call, which is always structured JSON. This is the most reliable option for Claude-based applications.

Level 5: Retry with Error Feedback

When validation fails, retry with the error message as context:

import asyncio
from typing import TypeVar, Type
from pydantic import BaseModel

T = TypeVar('T', bound=BaseModel)

async def extract_with_retry(
    prompt: str,
    response_model: Type[T],
    model: str = "claude-opus-4-6",
    max_retries: int = 3
) -> T:
    client = anthropic.Anthropic()
    
    messages = [{"role": "user", "content": prompt}]
    last_error = None
    
    for attempt in range(max_retries):
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            messages=messages
        )
        
        raw_text = response.content[0].text
        
        try:
            cleaned = clean_llm_json(raw_text)
            data = json.loads(cleaned)
            return response_model.model_validate(data)
        
        except (json.JSONDecodeError, ValueError) as e:
            last_error = str(e)
            
            if attempt < max_retries - 1:
                # Add error feedback to conversation
                messages.append({"role": "assistant", "content": raw_text})
                messages.append({
                    "role": "user", 
                    "content": f"Your response failed validation:\n{last_error}\n\n"
                               f"Please fix and return valid JSON only."
                })
    
    raise ValueError(f"Failed after {max_retries} attempts. Last error: {last_error}")

Retry with error feedback typically resolves 80-90% of initial failures on the first retry.

Testing Your Validation Layer

Build a test suite for your extraction pipeline:

import pytest
from your_app.extraction import extract_ticket, TicketExtraction

# Normal cases
def test_billing_ticket():
    result = extract_ticket(
        "My credit card charge of $99 failed this month. Account ID: ACC-12345"
    )
    assert result.category == "billing"
    assert result.extracted_data.account_id == "ACC-12345"
    assert isinstance(result.requires_escalation, bool)

def test_technical_ticket():
    result = extract_ticket("Getting ERR-502 when I try to save my settings")
    assert result.category == "technical"
    assert result.extracted_data.error_code == "ERR-502"

# Edge cases — common LLM failure modes
def test_handles_markdown_wrapped_json():
    """Ensure clean_llm_json strips markdown code blocks."""
    raw = '```json\n{"key": "value"}\n```'
    cleaned = clean_llm_json(raw)
    assert json.loads(cleaned) == {"key": "value"}

def test_handles_trailing_explanation():
    """Ensure parser ignores text after JSON."""
    raw = '{"key": "value"}\n\nThis JSON contains the extracted information.'
    cleaned = clean_llm_json(raw)
    assert json.loads(cleaned) == {"key": "value"}

# Adversarial inputs
def test_handles_empty_ticket():
    result = extract_ticket("")
    assert result.category in ["billing", "technical", "account", "general"]
    # Should still return a valid structure, even if summary is generic

def test_handles_very_long_ticket():
    long_ticket = "error " * 2000  # Very long input
    result = extract_ticket(long_ticket)
    assert isinstance(result, TicketExtraction)
    assert len(result.summary) <= 500  # Schema constraint holds

def test_invalid_enum_value_triggers_retry(mocker):
    """If model returns invalid enum, retry should fix it."""
    mock_responses = [
        '{"category": "payment", "priority": "urgent", ...}',  # Invalid enum values
        '{"category": "billing", "priority": "high", ...}'      # Valid on retry
    ]
    # ... mock setup

Production Metrics

Track these metrics for your extraction pipeline:

from prometheus_client import Counter, Histogram, Gauge

extraction_attempts = Counter(
    'llm_extraction_attempts_total',
    'Total extraction attempts',
    ['model', 'endpoint']
)
extraction_failures = Counter(
    'llm_extraction_failures_total',
    'Extraction failures after all retries',
    ['model', 'endpoint', 'failure_type']
)
retry_count = Histogram(
    'llm_extraction_retries',
    'Number of retries per extraction',
    buckets=[0, 1, 2, 3]
)

Alert when:

Failure rate > 2% (after retries)
Mean retry count > 0.5 (model is consistently struggling with schema)
Specific field validation errors spike (indicates schema drift or model regression)

Summary

Approach	Reliability	Complexity	Best For
Prompt engineering	85-95%	Low	Simple schemas
JSON Schema validation + retry	95-99%	Medium	Most applications
Pydantic + retry	96-99%	Medium	Python applications
Tool use (Claude)	99%+	Low	Claude-based apps
Constrained generation	99.9%+	High	Open-source models, critical paths

Start with tool use if you're on Claude, Instructor if you need model flexibility, or JSON Schema validation + retry for the simplest setup. Add constrained generation only if you have reliability requirements that simpler approaches can't meet.

The key insight: don't fight LLM non-determinism — constrain it. The model doesn't need to choose how to format the output; your code should make format errors structurally impossible.