AI Testing

Synthetic Test Data Generation with LLMs: Edge Cases at Scale

HelpMeTest

19 May 2026 — 9 min read

Real test data is scarce, sensitive, and hard to maintain. LLMs can generate realistic synthetic test data at scale — covering edge cases that real datasets miss and eliminating PII concerns. This guide covers practical techniques for generating test data with LLMs: schema-based generation, edge case discovery, data augmentation, and validating that synthetic data matches real distribution properties.

Key Takeaways

Describe the schema, not the examples. Give the LLM your data schema and constraints, then let it generate examples — not the other way around. Seeding with real examples biases the LLM toward what you already have.

Prompt for failure modes, not happy paths. Ask the LLM to generate data that would break your system. This produces more valuable test data than asking for "typical examples."

Validate synthetic data against real data statistically. Distribution match tests catch when synthetic data drifts from realistic patterns — before those unrealistic inputs corrupt your test results.

Generate edge cases programmatically, not just with LLMs. Boundary values, encoding edge cases, and constraint violations are better enumerated systematically than generated by LLM prompting.

Track synthetic data provenance. Tag every synthetic sample with its generation prompt and model version so you can reproduce and audit it. Synthetic data that can't be explained is a liability.

Why Synthetic Test Data

Test data generation faces three constraints:

Scarcity — you don't have enough real data covering the edge cases you need
Privacy — real user data contains PII that can't safely be used in test environments
Coverage — the long tail of rare inputs that cause bugs is underrepresented in real data

LLMs address all three. A well-prompted model can generate thousands of realistic synthetic records, covering rare cases, with no real-user PII.

The risk: synthetic data can be unrealistic in subtle ways — matching the surface form of real data but missing distributional properties that matter for testing. This guide addresses how to generate well and validate that it's good enough.

Schema-Based Generation

The most reliable approach: give the LLM a schema and generate from it:

from openai import OpenAI
from pydantic import BaseModel, validator
from typing import Literal
import json

client = OpenAI()

class UserProfile(BaseModel):
    """Schema for user profile test data."""
    user_id: str
    email: str
    name: str
    plan: Literal["free", "pro", "enterprise"]
    test_count: int  # 0-10 for free, unlimited for paid
    is_verified: bool
    country_code: str  # ISO 3166-1 alpha-2

GENERATION_PROMPT = """
Generate {n} realistic user profile records as a JSON array.

Schema:
{schema}

Requirements:
- user_id: format "usr_" + 12 random alphanumeric characters
- email: realistic-looking, use domains like gmail.com, outlook.com, company.io — never real company domains
- name: realistic names from diverse cultures (not all Western names)
- plan: distribute as 70% free, 25% pro, 5% enterprise
- test_count: 0-10 for free users, 0-50 for pro, 0-500 for enterprise
- is_verified: 80% true for pro/enterprise, 60% for free
- country_code: diverse — include US, GB, DE, IN, BR, JP, NG, AU, CA, FR

Return only the JSON array, no explanation.
"""

def generate_user_profiles(n: int = 100) -> list[UserProfile]:
    schema_str = json.dumps(UserProfile.schema(), indent=2)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": GENERATION_PROMPT.format(n=n, schema=schema_str)
        }],
        temperature=0.8,  # Some randomness for diversity
        response_format={"type": "json_object"}
    )
    
    raw = json.loads(response.choices[0].message.content)
    profiles_data = raw if isinstance(raw, list) else raw.get("profiles", raw.get("data", []))
    
    validated = []
    for item in profiles_data:
        try:
            validated.append(UserProfile(**item))
        except Exception as e:
            print(f"Skipping invalid record: {e}")
    
    return validated

# Generate and save
profiles = generate_user_profiles(200)
print(f"Generated {len(profiles)} valid profiles")

with open("tests/fixtures/user_profiles.json", "w") as f:
    json.dump([p.dict() for p in profiles], f, indent=2)

Structured Output for Reliable Parsing

Use JSON mode or structured outputs to avoid parsing failures:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class TestTransaction(BaseModel):
    id: str
    amount_cents: int
    currency: str
    status: str
    merchant: str
    card_last4: str
    timestamp: str

class TransactionBatch(BaseModel):
    transactions: list[TestTransaction]

def generate_transactions_typed(
    n: int,
    scenario: str = "normal"
) -> list[TestTransaction]:
    """Generate transactions with full type safety via structured outputs."""
    
    scenario_prompts = {
        "normal": "Generate typical e-commerce transactions",
        "edge_cases": "Generate transactions with unusual characteristics: very high amounts, tiny amounts ($0.01), foreign currencies, long merchant names, failed statuses",
        "fraud_patterns": "Generate transaction sequences that look like fraud patterns: rapid sequential charges, geographic impossibilities, round-number charges"
    }
    
    prompt = f"""
{scenario_prompts[scenario]}.

Requirements:
- amount_cents: integer (avoid neat round numbers unless edge_cases)
- currency: mostly USD, some EUR/GBP/JPY/BRL
- status: completed (70%), pending (20%), failed (10%)
- merchant: realistic merchant names (not real company names)
- card_last4: 4 digit string "0000"-"9999"  
- timestamp: ISO 8601, within last 30 days

Generate exactly {n} transactions.
"""
    
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format=TransactionBatch
    )
    
    return result.choices[0].message.parsed.transactions

Edge Case Discovery

Ask the LLM to find failure modes:

EDGE_CASE_PROMPT = """
You are testing a {system_description}.

Think about:
1. Input boundary conditions (empty, null, maximum length, minimum values)
2. Unusual but valid inputs that might trigger bugs
3. Inputs that violate assumptions developers commonly make
4. Internationalization edge cases (RTL text, emoji, unusual Unicode, very long names in some cultures)
5. Semantic edge cases (inputs that are grammatically valid but semantically unusual)

For the following function/feature: {feature_description}

Generate {n} test inputs that cover these edge cases. For each, explain why it might expose a bug.

Return as JSON array: [{"input": "...", "category": "...", "why_problematic": "..."}]
"""

def discover_edge_cases(
    system_description: str,
    feature_description: str,
    n: int = 30
) -> list[dict]:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": EDGE_CASE_PROMPT.format(
                system_description=system_description,
                feature_description=feature_description,
                n=n
            )
        }],
        temperature=0.9,  # High temperature for more creative edge cases
        response_format={"type": "json_object"}
    )
    
    result = json.loads(response.choices[0].message.content)
    return result if isinstance(result, list) else result.get("cases", [])

# Example usage
edge_cases = discover_edge_cases(
    system_description="A customer support chatbot that handles billing inquiries",
    feature_description="A function that extracts the customer's account ID from free-text messages",
    n=25
)

for case in edge_cases[:5]:
    print(f"Input: {case['input']!r}")
    print(f"Category: {case['category']}")
    print(f"Why problematic: {case['why_problematic']}\n")

Programmatic Edge Case Generation

Some edge cases are better enumerated than generated:

def generate_boundary_strings(max_length: int = 255) -> list[str]:
    """Generate systematic string boundary cases."""
    return [
        "",                              # Empty
        " ",                             # Single space
        " " * max_length,               # All spaces
        "a",                             # Single char
        "a" * (max_length - 1),         # Max length - 1
        "a" * max_length,               # Max length
        "a" * (max_length + 1),         # Over max length
        "A" * max_length,               # All uppercase
        "1" * max_length,               # All digits
        "\n\t\r",                        # Control characters
        "Hello\x00World",               # Null byte
        "🎉" * 20,                       # Emoji (multi-byte)
        "مرحبا",                          # RTL Arabic
        "こんにちは",                        # Japanese
        "Ñoño",                          # Latin extended
        "<script>alert(1)</script>",    # XSS attempt
        "'; DROP TABLE users; --",      # SQL injection attempt
        "../../../etc/passwd",          # Path traversal
        "A" * 10000,                    # Very long string
    ]

def generate_numeric_boundaries(
    min_val: int | float = 0,
    max_val: int | float = 100
) -> list[int | float]:
    """Generate boundary values for numeric fields."""
    return [
        min_val,
        min_val - 1,        # Just below minimum
        min_val + 1,        # Just above minimum
        max_val,
        max_val + 1,        # Just above maximum
        max_val - 1,        # Just below maximum
        0,
        -1,
        1,
        -0.0,               # Negative zero (floats)
        float('inf'),       # Infinity
        float('-inf'),      # Negative infinity
        float('nan'),       # NaN
    ]

Data Augmentation

Augment existing datasets to increase coverage:

AUGMENTATION_PROMPT = """
Given this original example:

Input: {original_input}
Label: {original_label}

Generate {n} paraphrased variations that:
1. Preserve the same meaning and label
2. Use different vocabulary and sentence structure
3. Vary in formality (some more casual, some more formal)
4. Include some with typos or informal spelling (10% of variations)
5. Vary length (some shorter, some longer)

Return as JSON array: [{"input": "...", "label": "{original_label}"}]
"""

def augment_dataset(
    examples: list[dict],
    target_size: int,
    augmentation_factor: int | None = None
) -> list[dict]:
    """Augment a labeled dataset by paraphrasing existing examples."""
    if augmentation_factor is None:
        augmentation_factor = max(1, target_size // len(examples))
    
    augmented = list(examples)  # Keep originals
    
    for example in examples:
        if len(augmented) >= target_size:
            break
        
        n_to_generate = min(augmentation_factor, target_size - len(augmented))
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",  # Cheaper for bulk augmentation
            messages=[{
                "role": "user",
                "content": AUGMENTATION_PROMPT.format(
                    original_input=example["input"],
                    original_label=example["label"],
                    n=n_to_generate
                )
            }],
            temperature=0.8,
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        new_examples = result if isinstance(result, list) else result.get("examples", [])
        
        for new_example in new_examples:
            augmented.append({
                **new_example,
                "source": "augmented",
                "original_id": example.get("id")
            })
    
    return augmented[:target_size]

Validating Synthetic Data Quality

Synthetic data that doesn't match real distribution properties can produce misleading test results:

import numpy as np
from scipy.stats import ks_2samp, chi2_contingency
from sentence_transformers import SentenceTransformer

def validate_synthetic_vs_real(
    synthetic: list[str],
    real: list[str],
    max_ks_statistic: float = 0.15
) -> dict:
    """
    Check if synthetic text data has similar distributional properties to real data.
    Uses semantic embeddings + length distribution comparison.
    """
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # Compare length distributions
    real_lengths = [len(s.split()) for s in real]
    synthetic_lengths = [len(s.split()) for s in synthetic]
    
    ks_stat, ks_p = ks_2samp(real_lengths, synthetic_lengths)
    
    # Compare semantic distributions
    real_embeddings = model.encode(real[:200])  # Sample for speed
    synthetic_embeddings = model.encode(synthetic[:200])
    
    from sklearn.metrics.pairwise import cosine_similarity
    
    real_centroid = real_embeddings.mean(axis=0)
    
    real_sim_to_centroid = cosine_similarity([real_centroid], real_embeddings)[0]
    synthetic_sim_to_centroid = cosine_similarity([real_centroid], synthetic_embeddings)[0]
    
    sem_ks_stat, sem_ks_p = ks_2samp(real_sim_to_centroid, synthetic_sim_to_centroid)
    
    quality_score = 1.0 - max(ks_stat, sem_ks_stat)
    
    warnings = []
    if ks_stat > max_ks_statistic:
        warnings.append(f"Length distribution differs (KS={ks_stat:.3f}). Synthetic text may be too uniform.")
    if sem_ks_stat > max_ks_statistic:
        warnings.append(f"Semantic distribution differs (KS={sem_ks_stat:.3f}). Synthetic text may lack topic diversity.")
    
    return {
        "quality_score": quality_score,
        "length_ks_statistic": ks_stat,
        "semantic_ks_statistic": sem_ks_stat,
        "warnings": warnings,
        "verdict": "acceptable" if not warnings else "needs_review"
    }

def validate_categorical_distribution(
    synthetic: list[dict],
    real: list[dict],
    categorical_fields: list[str]
) -> dict:
    """Check that categorical fields have similar distributions to real data."""
    results = {}
    
    for field in categorical_fields:
        from collections import Counter
        
        real_counts = Counter(item.get(field) for item in real)
        synthetic_counts = Counter(item.get(field) for item in synthetic)
        
        # Get all categories
        all_categories = set(real_counts) | set(synthetic_counts)
        
        real_freq = [real_counts.get(c, 0) for c in all_categories]
        synthetic_freq = [synthetic_counts.get(c, 0) for c in all_categories]
        
        # Chi-square test
        observed = np.array([synthetic_freq])
        expected = np.array([real_freq]) * (sum(synthetic_freq) / sum(real_freq))
        
        chi2, p_value, *_ = chi2_contingency(np.vstack([observed, expected]))
        
        results[field] = {
            "chi2": chi2,
            "p_value": p_value,
            "distributions_match": p_value > 0.05,
            "real_distribution": dict(real_counts.most_common(5)),
            "synthetic_distribution": dict(synthetic_counts.most_common(5))
        }
    
    return results

Tracking Provenance

Every synthetic record needs a paper trail:

import hashlib
from dataclasses import dataclass, asdict

@dataclass
class SyntheticDataRecord:
    data: dict
    generation_prompt: str
    model: str
    model_version: str
    temperature: float
    seed: int | None
    generated_at: str
    schema_version: str
    
    @property
    def fingerprint(self) -> str:
        """Stable hash for deduplication."""
        content = json.dumps(self.data, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()[:16]

def generate_with_provenance(
    prompt: str,
    schema: type,
    n: int,
    model: str = "gpt-4o",
    temperature: float = 0.8
) -> list[SyntheticDataRecord]:
    from datetime import datetime
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        response_format={"type": "json_object"}
    )
    
    raw_items = json.loads(response.choices[0].message.content)
    items = raw_items if isinstance(raw_items, list) else list(raw_items.values())[0]
    
    records = []
    for item in items:
        records.append(SyntheticDataRecord(
            data=item,
            generation_prompt=prompt,
            model=model,
            model_version=response.model,
            temperature=temperature,
            seed=None,
            generated_at=datetime.now().isoformat(),
            schema_version=schema.__name__ + "_v1"
        ))
    
    return records

def save_dataset_with_provenance(
    records: list[SyntheticDataRecord],
    output_file: str
):
    with open(output_file, "w") as f:
        for record in records:
            line = {
                "data": record.data,
                "provenance": {
                    "fingerprint": record.fingerprint,
                    "model": record.model,
                    "generated_at": record.generated_at,
                    "schema_version": record.schema_version
                }
            }
            f.write(json.dumps(line) + "\n")

CI Integration

# tests/test_synthetic_data_quality.py
import pytest
import json
from pathlib import Path

class TestSyntheticDataQuality:
    
    @pytest.fixture(scope="class")
    def synthetic_profiles(self):
        with open("tests/fixtures/user_profiles.json") as f:
            return json.load(f)
    
    def test_schema_compliance(self, synthetic_profiles):
        """Every record must pass schema validation."""
        invalid = []
        for i, profile in enumerate(synthetic_profiles):
            try:
                UserProfile(**profile)
            except Exception as e:
                invalid.append({"index": i, "error": str(e)})
        
        assert not invalid, f"{len(invalid)} records fail schema validation:\n" + json.dumps(invalid[:3], indent=2)
    
    def test_plan_distribution(self, synthetic_profiles):
        """Plan distribution should be roughly 70/25/5."""
        from collections import Counter
        counts = Counter(p["plan"] for p in synthetic_profiles)
        total = len(synthetic_profiles)
        
        free_pct = counts["free"] / total
        assert 0.60 <= free_pct <= 0.80, f"Free plan % out of range: {free_pct:.0%}"
    
    def test_no_real_pii(self, synthetic_profiles):
        """No real email domains, no real names that match known individuals."""
        blocked_domains = {"google.com", "apple.com", "microsoft.com", "amazon.com"}
        
        for profile in synthetic_profiles:
            domain = profile["email"].split("@")[-1]
            assert domain not in blocked_domains, f"Real company domain found: {domain}"
    
    def test_country_diversity(self, synthetic_profiles):
        """Should have at least 5 different country codes."""
        countries = set(p["country_code"] for p in synthetic_profiles)
        assert len(countries) >= 5, f"Insufficient country diversity: {countries}"
    
    def test_data_freshness(self, synthetic_profiles):
        """Synthetic data should be regenerated if older than 90 days."""
        metadata_file = Path("tests/fixtures/user_profiles_metadata.json")
        if not metadata_file.exists():
            pytest.skip("No metadata file — skipping freshness check")
        
        metadata = json.loads(metadata_file.read_text())
        generated_at = datetime.fromisoformat(metadata["generated_at"])
        age_days = (datetime.now() - generated_at).days
        
        if age_days > 90:
            pytest.warns(UserWarning, f"Synthetic data is {age_days} days old. Consider regenerating.")

Monitoring Synthetic Data Health with HelpMeTest

Synthetic datasets age — as your application evolves, the coverage they provide can become stale. HelpMeTest can run weekly checks to validate that your synthetic fixtures still reflect current schema requirements:

*** Test Cases ***
Weekly Synthetic Data Health Check
    [Documentation]    Verify synthetic test fixtures are valid and fresh
    ${result}=    Run Schema Validation    tests/fixtures/user_profiles.json
    Should Be True    ${result.valid_count} > 190    
    ...    msg=Too many invalid records: ${result.invalid_count} failures
    
    ${age_days}=    Get Fixture Age Days    tests/fixtures/user_profiles.json
    Should Be True    ${age_days} < 90    
    ...    msg=Synthetic data is ${age_days} days old — regeneration recommended

Summary

Effective synthetic test data generation:

Schema-based generation — give the LLM your schema and constraints, not examples to copy
Edge case discovery — prompt for failure modes, not happy paths
Programmatic boundaries — enumerate boundary values systematically, not via LLM
Augmentation — paraphrase existing labeled examples to expand coverage
Validation — test that synthetic distribution matches real data statistically
Provenance tracking — every synthetic record knows how it was generated

The payoff: comprehensive test coverage without real user data, covering the edge cases that real datasets underrepresent.