LLM Output Schema Validation: Enforcing Structured JSON from AI Models
The most common source of production failures in LLM applications isn't a bad answer — it's a bad format. Your code expects {"sentiment": "positive", "score": 0.87} and the model returns The sentiment is positive with a confidence of 87%. JSON.parse throws. Your application crashes.
Structured output validation is the unsexy but critical layer between your LLM and the rest of your application. This guide covers the tools and patterns to make LLM output reliable.
Why LLMs Struggle With Structured Output
LLMs are trained to generate natural language, not structured data. Even with explicit instructions, they can fail in several ways:
- Markdown wrapping:
```json\n{"key": "value"}\n```instead of raw JSON - Trailing text: Valid JSON followed by an explanation
- Number formatting:
"score": 0.87becomes"score": "0.87"(string, not float) - Missing fields: Omitting optional-but-expected fields
- Extra fields: Adding keys not in your schema
- Hallucinated enum values:
"status": "completed"when valid values are["done", "pending"] - Nested structure errors: Putting arrays where objects are expected
The frequency of these errors depends on model capability and prompt quality. Even frontier models fail 2-10% of the time on complex schemas without constrained generation.
Level 1: Prompt Engineering for Structure
Before implementing validation infrastructure, optimize your prompt for reliable output.
Be explicit about format:
EXTRACTION_PROMPT = """Extract the following information from the support ticket.
Return ONLY valid JSON matching this exact schema. No markdown, no explanation, no trailing text.
Schema:
{
"category": string, // One of: "billing", "technical", "account", "general"
"priority": string, // One of: "low", "medium", "high", "critical"
"summary": string, // 1-2 sentence summary
"extracted_data": {
"product_mentioned": string | null,
"error_code": string | null,
"account_id": string | null
},
"requires_escalation": boolean
}
Ticket:
{ticket_content}
JSON response:"""Use few-shot examples:
EXTRACTION_PROMPT_FEW_SHOT = """Extract structured data from support tickets.
Example input:
"My account #ACC-12345 shows an error code ERR-502 when I try to upgrade my billing plan."
Example output:
{"category": "billing", "priority": "medium", "summary": "Account upgrade failing with ERR-502 error", "extracted_data": {"product_mentioned": null, "error_code": "ERR-502", "account_id": "ACC-12345"}, "requires_escalation": false}
Now extract from this ticket:
{ticket_content}
JSON:"""Few-shot examples improve format compliance by 20-40% compared to schema-only prompts.
Level 2: JSON Schema Validation
Validate extracted JSON against a schema before using it:
import json
import jsonschema
from jsonschema import validate, ValidationError
RESPONSE_SCHEMA = {
"type": "object",
"required": ["category", "priority", "summary", "extracted_data", "requires_escalation"],
"properties": {
"category": {
"type": "string",
"enum": ["billing", "technical", "account", "general"]
},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "critical"]
},
"summary": {
"type": "string",
"minLength": 10,
"maxLength": 500
},
"extracted_data": {
"type": "object",
"required": ["product_mentioned", "error_code", "account_id"],
"properties": {
"product_mentioned": {"type": ["string", "null"]},
"error_code": {"type": ["string", "null"]},
"account_id": {"type": ["string", "null"]}
}
},
"requires_escalation": {"type": "boolean"}
},
"additionalProperties": False
}
def extract_and_validate(llm_response: str) -> dict:
# Clean common LLM formatting issues
cleaned = clean_llm_json(llm_response)
# Parse JSON
try:
data = json.loads(cleaned)
except json.JSONDecodeError as e:
raise ValueError(f"LLM returned invalid JSON: {e}\nRaw response: {llm_response[:500]}")
# Validate against schema
try:
validate(instance=data, schema=RESPONSE_SCHEMA)
except ValidationError as e:
raise ValueError(f"LLM response failed schema validation: {e.message}")
return data
def clean_llm_json(text: str) -> str:
"""Strip common LLM formatting artifacts from JSON responses."""
text = text.strip()
# Remove markdown code blocks
if text.startswith('```'):
lines = text.split('\n')
# Remove first line (```json or ```) and last line (```)
text = '\n'.join(lines[1:-1])
# Find JSON object/array boundaries
start = text.find('{')
if start == -1:
start = text.find('[')
end = text.rfind('}')
if end == -1:
end = text.rfind(']')
if start != -1 and end != -1:
text = text[start:end+1]
return text.strip()Level 3: Pydantic Models
Pydantic provides schema validation, type coercion, and clean Python objects in one package:
from pydantic import BaseModel, field_validator, model_validator
from typing import Literal
from enum import Enum
class Priority(str, Enum):
low = "low"
medium = "medium"
high = "high"
critical = "critical"
class ExtractedData(BaseModel):
product_mentioned: str | None = None
error_code: str | None = None
account_id: str | None = None
class TicketExtraction(BaseModel):
category: Literal["billing", "technical", "account", "general"]
priority: Priority
summary: str
extracted_data: ExtractedData
requires_escalation: bool
@field_validator('summary')
@classmethod
def summary_not_empty(cls, v):
if len(v.strip()) < 10:
raise ValueError('Summary too short — must be at least 10 characters')
return v.strip()
@model_validator(mode='after')
def escalation_requires_high_priority(self):
if self.requires_escalation and self.priority not in (Priority.high, Priority.critical):
raise ValueError('Escalated tickets must be high or critical priority')
return self
def parse_ticket_extraction(llm_response: str) -> TicketExtraction:
cleaned = clean_llm_json(llm_response)
data = json.loads(cleaned)
return TicketExtraction.model_validate(data)
# Usage
try:
result = parse_ticket_extraction(llm_response)
print(f"Priority: {result.priority}")
print(f"Needs escalation: {result.requires_escalation}")
except (json.JSONDecodeError, ValueError) as e:
# Handle validation failure
handle_parse_failure(llm_response, e)Level 4: Constrained Generation
For maximum reliability, use constrained generation — forcing the model to output tokens that conform to your schema at the token sampling level.
Using Instructor
Instructor wraps the Anthropic API to return Pydantic models directly:
import instructor
import anthropic
from pydantic import BaseModel
# Patch the Anthropic client
client = instructor.from_anthropic(anthropic.Anthropic())
class TicketExtraction(BaseModel):
category: Literal["billing", "technical", "account", "general"]
priority: Literal["low", "medium", "high", "critical"]
summary: str
requires_escalation: bool
def extract_ticket(ticket_content: str) -> TicketExtraction:
result = client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": f"Extract structured data from this support ticket:\n\n{ticket_content}"
}],
response_model=TicketExtraction, # Instructor handles the rest
)
return result # Already a TicketExtraction instance
ticket = extract_ticket("My billing failed with error ERR-402 on account ACC-789")
print(f"Category: {ticket.category}") # billing
print(f"Priority: {ticket.priority}") # mediumInstructor handles retry logic, schema injection, and response parsing automatically.
Using Outlines (Open-Source Models)
For self-hosted or open-source models, Outlines constrains generation at the logit level:
import outlines
import outlines.models as models
from pydantic import BaseModel
model = models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
class TicketExtraction(BaseModel):
category: Literal["billing", "technical", "account", "general"]
priority: Literal["low", "medium", "high", "critical"]
requires_escalation: bool
# Constrained generation — model CANNOT produce invalid JSON
generator = outlines.generate.json(model, TicketExtraction)
result = generator(
f"Extract data from: 'My billing failed on account ACC-789'",
)
print(result) # Guaranteed valid TicketExtractionClaude's Tool Use as Structured Output
Claude's tool use (function calling) forces structured output naturally:
import anthropic
import json
client = anthropic.Anthropic()
tools = [{
"name": "extract_ticket_data",
"description": "Extract structured data from a support ticket",
"input_schema": {
"type": "object",
"required": ["category", "priority", "summary", "requires_escalation"],
"properties": {
"category": {
"type": "string",
"enum": ["billing", "technical", "account", "general"],
"description": "Ticket category"
},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "critical"]
},
"summary": {
"type": "string",
"description": "1-2 sentence summary of the issue"
},
"requires_escalation": {
"type": "boolean"
}
}
}
}]
def extract_with_tools(ticket_content: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
tools=tools,
tool_choice={"type": "any"}, # Force tool use
messages=[{
"role": "user",
"content": f"Extract data from this ticket:\n\n{ticket_content}"
}]
)
# Extract tool use block
for block in response.content:
if block.type == "tool_use":
return block.input # Already parsed JSON matching the schema
raise ValueError("Model did not use the extraction tool")
result = extract_with_tools("Billing failed with ERR-402")
print(result) # {"category": "billing", "priority": "medium", ...}Tool use with tool_choice: any forces Claude to output a valid tool call, which is always structured JSON. This is the most reliable option for Claude-based applications.
Level 5: Retry with Error Feedback
When validation fails, retry with the error message as context:
import asyncio
from typing import TypeVar, Type
from pydantic import BaseModel
T = TypeVar('T', bound=BaseModel)
async def extract_with_retry(
prompt: str,
response_model: Type[T],
model: str = "claude-opus-4-6",
max_retries: int = 3
) -> T:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": prompt}]
last_error = None
for attempt in range(max_retries):
response = client.messages.create(
model=model,
max_tokens=1024,
messages=messages
)
raw_text = response.content[0].text
try:
cleaned = clean_llm_json(raw_text)
data = json.loads(cleaned)
return response_model.model_validate(data)
except (json.JSONDecodeError, ValueError) as e:
last_error = str(e)
if attempt < max_retries - 1:
# Add error feedback to conversation
messages.append({"role": "assistant", "content": raw_text})
messages.append({
"role": "user",
"content": f"Your response failed validation:\n{last_error}\n\n"
f"Please fix and return valid JSON only."
})
raise ValueError(f"Failed after {max_retries} attempts. Last error: {last_error}")Retry with error feedback typically resolves 80-90% of initial failures on the first retry.
Testing Your Validation Layer
Build a test suite for your extraction pipeline:
import pytest
from your_app.extraction import extract_ticket, TicketExtraction
# Normal cases
def test_billing_ticket():
result = extract_ticket(
"My credit card charge of $99 failed this month. Account ID: ACC-12345"
)
assert result.category == "billing"
assert result.extracted_data.account_id == "ACC-12345"
assert isinstance(result.requires_escalation, bool)
def test_technical_ticket():
result = extract_ticket("Getting ERR-502 when I try to save my settings")
assert result.category == "technical"
assert result.extracted_data.error_code == "ERR-502"
# Edge cases — common LLM failure modes
def test_handles_markdown_wrapped_json():
"""Ensure clean_llm_json strips markdown code blocks."""
raw = '```json\n{"key": "value"}\n```'
cleaned = clean_llm_json(raw)
assert json.loads(cleaned) == {"key": "value"}
def test_handles_trailing_explanation():
"""Ensure parser ignores text after JSON."""
raw = '{"key": "value"}\n\nThis JSON contains the extracted information.'
cleaned = clean_llm_json(raw)
assert json.loads(cleaned) == {"key": "value"}
# Adversarial inputs
def test_handles_empty_ticket():
result = extract_ticket("")
assert result.category in ["billing", "technical", "account", "general"]
# Should still return a valid structure, even if summary is generic
def test_handles_very_long_ticket():
long_ticket = "error " * 2000 # Very long input
result = extract_ticket(long_ticket)
assert isinstance(result, TicketExtraction)
assert len(result.summary) <= 500 # Schema constraint holds
def test_invalid_enum_value_triggers_retry(mocker):
"""If model returns invalid enum, retry should fix it."""
mock_responses = [
'{"category": "payment", "priority": "urgent", ...}', # Invalid enum values
'{"category": "billing", "priority": "high", ...}' # Valid on retry
]
# ... mock setupProduction Metrics
Track these metrics for your extraction pipeline:
from prometheus_client import Counter, Histogram, Gauge
extraction_attempts = Counter(
'llm_extraction_attempts_total',
'Total extraction attempts',
['model', 'endpoint']
)
extraction_failures = Counter(
'llm_extraction_failures_total',
'Extraction failures after all retries',
['model', 'endpoint', 'failure_type']
)
retry_count = Histogram(
'llm_extraction_retries',
'Number of retries per extraction',
buckets=[0, 1, 2, 3]
)Alert when:
- Failure rate > 2% (after retries)
- Mean retry count > 0.5 (model is consistently struggling with schema)
- Specific field validation errors spike (indicates schema drift or model regression)
Summary
| Approach | Reliability | Complexity | Best For |
|---|---|---|---|
| Prompt engineering | 85-95% | Low | Simple schemas |
| JSON Schema validation + retry | 95-99% | Medium | Most applications |
| Pydantic + retry | 96-99% | Medium | Python applications |
| Tool use (Claude) | 99%+ | Low | Claude-based apps |
| Constrained generation | 99.9%+ | High | Open-source models, critical paths |
Start with tool use if you're on Claude, Instructor if you need model flexibility, or JSON Schema validation + retry for the simplest setup. Add constrained generation only if you have reliability requirements that simpler approaches can't meet.
The key insight: don't fight LLM non-determinism — constrain it. The model doesn't need to choose how to format the output; your code should make format errors structurally impossible.