Synthetic Test Data Generation with LLMs: Edge Cases at Scale
Real test data is scarce, sensitive, and hard to maintain. LLMs can generate realistic synthetic test data at scale — covering edge cases that real datasets miss and eliminating PII concerns. This guide covers practical techniques for generating test data with LLMs: schema-based generation, edge case discovery, data augmentation, and validating that synthetic data matches real distribution properties.
Key Takeaways
Describe the schema, not the examples. Give the LLM your data schema and constraints, then let it generate examples — not the other way around. Seeding with real examples biases the LLM toward what you already have.
Prompt for failure modes, not happy paths. Ask the LLM to generate data that would break your system. This produces more valuable test data than asking for "typical examples."
Validate synthetic data against real data statistically. Distribution match tests catch when synthetic data drifts from realistic patterns — before those unrealistic inputs corrupt your test results.
Generate edge cases programmatically, not just with LLMs. Boundary values, encoding edge cases, and constraint violations are better enumerated systematically than generated by LLM prompting.
Track synthetic data provenance. Tag every synthetic sample with its generation prompt and model version so you can reproduce and audit it. Synthetic data that can't be explained is a liability.
Why Synthetic Test Data
Test data generation faces three constraints:
- Scarcity — you don't have enough real data covering the edge cases you need
- Privacy — real user data contains PII that can't safely be used in test environments
- Coverage — the long tail of rare inputs that cause bugs is underrepresented in real data
LLMs address all three. A well-prompted model can generate thousands of realistic synthetic records, covering rare cases, with no real-user PII.
The risk: synthetic data can be unrealistic in subtle ways — matching the surface form of real data but missing distributional properties that matter for testing. This guide addresses how to generate well and validate that it's good enough.
Schema-Based Generation
The most reliable approach: give the LLM a schema and generate from it:
from openai import OpenAI
from pydantic import BaseModel, validator
from typing import Literal
import json
client = OpenAI()
class UserProfile(BaseModel):
"""Schema for user profile test data."""
user_id: str
email: str
name: str
plan: Literal["free", "pro", "enterprise"]
test_count: int # 0-10 for free, unlimited for paid
is_verified: bool
country_code: str # ISO 3166-1 alpha-2
GENERATION_PROMPT = """
Generate {n} realistic user profile records as a JSON array.
Schema:
{schema}
Requirements:
- user_id: format "usr_" + 12 random alphanumeric characters
- email: realistic-looking, use domains like gmail.com, outlook.com, company.io — never real company domains
- name: realistic names from diverse cultures (not all Western names)
- plan: distribute as 70% free, 25% pro, 5% enterprise
- test_count: 0-10 for free users, 0-50 for pro, 0-500 for enterprise
- is_verified: 80% true for pro/enterprise, 60% for free
- country_code: diverse — include US, GB, DE, IN, BR, JP, NG, AU, CA, FR
Return only the JSON array, no explanation.
"""
def generate_user_profiles(n: int = 100) -> list[UserProfile]:
schema_str = json.dumps(UserProfile.schema(), indent=2)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": GENERATION_PROMPT.format(n=n, schema=schema_str)
}],
temperature=0.8, # Some randomness for diversity
response_format={"type": "json_object"}
)
raw = json.loads(response.choices[0].message.content)
profiles_data = raw if isinstance(raw, list) else raw.get("profiles", raw.get("data", []))
validated = []
for item in profiles_data:
try:
validated.append(UserProfile(**item))
except Exception as e:
print(f"Skipping invalid record: {e}")
return validated
# Generate and save
profiles = generate_user_profiles(200)
print(f"Generated {len(profiles)} valid profiles")
with open("tests/fixtures/user_profiles.json", "w") as f:
json.dump([p.dict() for p in profiles], f, indent=2)Structured Output for Reliable Parsing
Use JSON mode or structured outputs to avoid parsing failures:
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class TestTransaction(BaseModel):
id: str
amount_cents: int
currency: str
status: str
merchant: str
card_last4: str
timestamp: str
class TransactionBatch(BaseModel):
transactions: list[TestTransaction]
def generate_transactions_typed(
n: int,
scenario: str = "normal"
) -> list[TestTransaction]:
"""Generate transactions with full type safety via structured outputs."""
scenario_prompts = {
"normal": "Generate typical e-commerce transactions",
"edge_cases": "Generate transactions with unusual characteristics: very high amounts, tiny amounts ($0.01), foreign currencies, long merchant names, failed statuses",
"fraud_patterns": "Generate transaction sequences that look like fraud patterns: rapid sequential charges, geographic impossibilities, round-number charges"
}
prompt = f"""
{scenario_prompts[scenario]}.
Requirements:
- amount_cents: integer (avoid neat round numbers unless edge_cases)
- currency: mostly USD, some EUR/GBP/JPY/BRL
- status: completed (70%), pending (20%), failed (10%)
- merchant: realistic merchant names (not real company names)
- card_last4: 4 digit string "0000"-"9999"
- timestamp: ISO 8601, within last 30 days
Generate exactly {n} transactions.
"""
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format=TransactionBatch
)
return result.choices[0].message.parsed.transactionsEdge Case Discovery
Ask the LLM to find failure modes:
EDGE_CASE_PROMPT = """
You are testing a {system_description}.
Think about:
1. Input boundary conditions (empty, null, maximum length, minimum values)
2. Unusual but valid inputs that might trigger bugs
3. Inputs that violate assumptions developers commonly make
4. Internationalization edge cases (RTL text, emoji, unusual Unicode, very long names in some cultures)
5. Semantic edge cases (inputs that are grammatically valid but semantically unusual)
For the following function/feature: {feature_description}
Generate {n} test inputs that cover these edge cases. For each, explain why it might expose a bug.
Return as JSON array: [{"input": "...", "category": "...", "why_problematic": "..."}]
"""
def discover_edge_cases(
system_description: str,
feature_description: str,
n: int = 30
) -> list[dict]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": EDGE_CASE_PROMPT.format(
system_description=system_description,
feature_description=feature_description,
n=n
)
}],
temperature=0.9, # High temperature for more creative edge cases
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result if isinstance(result, list) else result.get("cases", [])
# Example usage
edge_cases = discover_edge_cases(
system_description="A customer support chatbot that handles billing inquiries",
feature_description="A function that extracts the customer's account ID from free-text messages",
n=25
)
for case in edge_cases[:5]:
print(f"Input: {case['input']!r}")
print(f"Category: {case['category']}")
print(f"Why problematic: {case['why_problematic']}\n")Programmatic Edge Case Generation
Some edge cases are better enumerated than generated:
def generate_boundary_strings(max_length: int = 255) -> list[str]:
"""Generate systematic string boundary cases."""
return [
"", # Empty
" ", # Single space
" " * max_length, # All spaces
"a", # Single char
"a" * (max_length - 1), # Max length - 1
"a" * max_length, # Max length
"a" * (max_length + 1), # Over max length
"A" * max_length, # All uppercase
"1" * max_length, # All digits
"\n\t\r", # Control characters
"Hello\x00World", # Null byte
"🎉" * 20, # Emoji (multi-byte)
"مرحبا", # RTL Arabic
"こんにちは", # Japanese
"Ñoño", # Latin extended
"<script>alert(1)</script>", # XSS attempt
"'; DROP TABLE users; --", # SQL injection attempt
"../../../etc/passwd", # Path traversal
"A" * 10000, # Very long string
]
def generate_numeric_boundaries(
min_val: int | float = 0,
max_val: int | float = 100
) -> list[int | float]:
"""Generate boundary values for numeric fields."""
return [
min_val,
min_val - 1, # Just below minimum
min_val + 1, # Just above minimum
max_val,
max_val + 1, # Just above maximum
max_val - 1, # Just below maximum
0,
-1,
1,
-0.0, # Negative zero (floats)
float('inf'), # Infinity
float('-inf'), # Negative infinity
float('nan'), # NaN
]Data Augmentation
Augment existing datasets to increase coverage:
AUGMENTATION_PROMPT = """
Given this original example:
Input: {original_input}
Label: {original_label}
Generate {n} paraphrased variations that:
1. Preserve the same meaning and label
2. Use different vocabulary and sentence structure
3. Vary in formality (some more casual, some more formal)
4. Include some with typos or informal spelling (10% of variations)
5. Vary length (some shorter, some longer)
Return as JSON array: [{"input": "...", "label": "{original_label}"}]
"""
def augment_dataset(
examples: list[dict],
target_size: int,
augmentation_factor: int | None = None
) -> list[dict]:
"""Augment a labeled dataset by paraphrasing existing examples."""
if augmentation_factor is None:
augmentation_factor = max(1, target_size // len(examples))
augmented = list(examples) # Keep originals
for example in examples:
if len(augmented) >= target_size:
break
n_to_generate = min(augmentation_factor, target_size - len(augmented))
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheaper for bulk augmentation
messages=[{
"role": "user",
"content": AUGMENTATION_PROMPT.format(
original_input=example["input"],
original_label=example["label"],
n=n_to_generate
)
}],
temperature=0.8,
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
new_examples = result if isinstance(result, list) else result.get("examples", [])
for new_example in new_examples:
augmented.append({
**new_example,
"source": "augmented",
"original_id": example.get("id")
})
return augmented[:target_size]Validating Synthetic Data Quality
Synthetic data that doesn't match real distribution properties can produce misleading test results:
import numpy as np
from scipy.stats import ks_2samp, chi2_contingency
from sentence_transformers import SentenceTransformer
def validate_synthetic_vs_real(
synthetic: list[str],
real: list[str],
max_ks_statistic: float = 0.15
) -> dict:
"""
Check if synthetic text data has similar distributional properties to real data.
Uses semantic embeddings + length distribution comparison.
"""
model = SentenceTransformer('all-MiniLM-L6-v2')
# Compare length distributions
real_lengths = [len(s.split()) for s in real]
synthetic_lengths = [len(s.split()) for s in synthetic]
ks_stat, ks_p = ks_2samp(real_lengths, synthetic_lengths)
# Compare semantic distributions
real_embeddings = model.encode(real[:200]) # Sample for speed
synthetic_embeddings = model.encode(synthetic[:200])
from sklearn.metrics.pairwise import cosine_similarity
real_centroid = real_embeddings.mean(axis=0)
real_sim_to_centroid = cosine_similarity([real_centroid], real_embeddings)[0]
synthetic_sim_to_centroid = cosine_similarity([real_centroid], synthetic_embeddings)[0]
sem_ks_stat, sem_ks_p = ks_2samp(real_sim_to_centroid, synthetic_sim_to_centroid)
quality_score = 1.0 - max(ks_stat, sem_ks_stat)
warnings = []
if ks_stat > max_ks_statistic:
warnings.append(f"Length distribution differs (KS={ks_stat:.3f}). Synthetic text may be too uniform.")
if sem_ks_stat > max_ks_statistic:
warnings.append(f"Semantic distribution differs (KS={sem_ks_stat:.3f}). Synthetic text may lack topic diversity.")
return {
"quality_score": quality_score,
"length_ks_statistic": ks_stat,
"semantic_ks_statistic": sem_ks_stat,
"warnings": warnings,
"verdict": "acceptable" if not warnings else "needs_review"
}
def validate_categorical_distribution(
synthetic: list[dict],
real: list[dict],
categorical_fields: list[str]
) -> dict:
"""Check that categorical fields have similar distributions to real data."""
results = {}
for field in categorical_fields:
from collections import Counter
real_counts = Counter(item.get(field) for item in real)
synthetic_counts = Counter(item.get(field) for item in synthetic)
# Get all categories
all_categories = set(real_counts) | set(synthetic_counts)
real_freq = [real_counts.get(c, 0) for c in all_categories]
synthetic_freq = [synthetic_counts.get(c, 0) for c in all_categories]
# Chi-square test
observed = np.array([synthetic_freq])
expected = np.array([real_freq]) * (sum(synthetic_freq) / sum(real_freq))
chi2, p_value, *_ = chi2_contingency(np.vstack([observed, expected]))
results[field] = {
"chi2": chi2,
"p_value": p_value,
"distributions_match": p_value > 0.05,
"real_distribution": dict(real_counts.most_common(5)),
"synthetic_distribution": dict(synthetic_counts.most_common(5))
}
return resultsTracking Provenance
Every synthetic record needs a paper trail:
import hashlib
from dataclasses import dataclass, asdict
@dataclass
class SyntheticDataRecord:
data: dict
generation_prompt: str
model: str
model_version: str
temperature: float
seed: int | None
generated_at: str
schema_version: str
@property
def fingerprint(self) -> str:
"""Stable hash for deduplication."""
content = json.dumps(self.data, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()[:16]
def generate_with_provenance(
prompt: str,
schema: type,
n: int,
model: str = "gpt-4o",
temperature: float = 0.8
) -> list[SyntheticDataRecord]:
from datetime import datetime
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
response_format={"type": "json_object"}
)
raw_items = json.loads(response.choices[0].message.content)
items = raw_items if isinstance(raw_items, list) else list(raw_items.values())[0]
records = []
for item in items:
records.append(SyntheticDataRecord(
data=item,
generation_prompt=prompt,
model=model,
model_version=response.model,
temperature=temperature,
seed=None,
generated_at=datetime.now().isoformat(),
schema_version=schema.__name__ + "_v1"
))
return records
def save_dataset_with_provenance(
records: list[SyntheticDataRecord],
output_file: str
):
with open(output_file, "w") as f:
for record in records:
line = {
"data": record.data,
"provenance": {
"fingerprint": record.fingerprint,
"model": record.model,
"generated_at": record.generated_at,
"schema_version": record.schema_version
}
}
f.write(json.dumps(line) + "\n")CI Integration
# tests/test_synthetic_data_quality.py
import pytest
import json
from pathlib import Path
class TestSyntheticDataQuality:
@pytest.fixture(scope="class")
def synthetic_profiles(self):
with open("tests/fixtures/user_profiles.json") as f:
return json.load(f)
def test_schema_compliance(self, synthetic_profiles):
"""Every record must pass schema validation."""
invalid = []
for i, profile in enumerate(synthetic_profiles):
try:
UserProfile(**profile)
except Exception as e:
invalid.append({"index": i, "error": str(e)})
assert not invalid, f"{len(invalid)} records fail schema validation:\n" + json.dumps(invalid[:3], indent=2)
def test_plan_distribution(self, synthetic_profiles):
"""Plan distribution should be roughly 70/25/5."""
from collections import Counter
counts = Counter(p["plan"] for p in synthetic_profiles)
total = len(synthetic_profiles)
free_pct = counts["free"] / total
assert 0.60 <= free_pct <= 0.80, f"Free plan % out of range: {free_pct:.0%}"
def test_no_real_pii(self, synthetic_profiles):
"""No real email domains, no real names that match known individuals."""
blocked_domains = {"google.com", "apple.com", "microsoft.com", "amazon.com"}
for profile in synthetic_profiles:
domain = profile["email"].split("@")[-1]
assert domain not in blocked_domains, f"Real company domain found: {domain}"
def test_country_diversity(self, synthetic_profiles):
"""Should have at least 5 different country codes."""
countries = set(p["country_code"] for p in synthetic_profiles)
assert len(countries) >= 5, f"Insufficient country diversity: {countries}"
def test_data_freshness(self, synthetic_profiles):
"""Synthetic data should be regenerated if older than 90 days."""
metadata_file = Path("tests/fixtures/user_profiles_metadata.json")
if not metadata_file.exists():
pytest.skip("No metadata file — skipping freshness check")
metadata = json.loads(metadata_file.read_text())
generated_at = datetime.fromisoformat(metadata["generated_at"])
age_days = (datetime.now() - generated_at).days
if age_days > 90:
pytest.warns(UserWarning, f"Synthetic data is {age_days} days old. Consider regenerating.")Monitoring Synthetic Data Health with HelpMeTest
Synthetic datasets age — as your application evolves, the coverage they provide can become stale. HelpMeTest can run weekly checks to validate that your synthetic fixtures still reflect current schema requirements:
*** Test Cases ***
Weekly Synthetic Data Health Check
[Documentation] Verify synthetic test fixtures are valid and fresh
${result}= Run Schema Validation tests/fixtures/user_profiles.json
Should Be True ${result.valid_count} > 190
... msg=Too many invalid records: ${result.invalid_count} failures
${age_days}= Get Fixture Age Days tests/fixtures/user_profiles.json
Should Be True ${age_days} < 90
... msg=Synthetic data is ${age_days} days old — regeneration recommendedSummary
Effective synthetic test data generation:
- Schema-based generation — give the LLM your schema and constraints, not examples to copy
- Edge case discovery — prompt for failure modes, not happy paths
- Programmatic boundaries — enumerate boundary values systematically, not via LLM
- Augmentation — paraphrase existing labeled examples to expand coverage
- Validation — test that synthetic distribution matches real data statistically
- Provenance tracking — every synthetic record knows how it was generated
The payoff: comprehensive test coverage without real user data, covering the edge cases that real datasets underrepresent.