Generating Test Data with AI: A Practical Guide

Generating Test Data with AI: A Practical Guide

AI can generate realistic, varied test data faster than manual creation or factory libraries. The key is prompting for diversity and edge cases, not just valid examples. This guide covers practical patterns for using LLMs to generate test data across formats — JSON fixtures, database seeds, CSV files, and domain-specific data like user profiles, addresses, and financial records.

Key Takeaways

Ask for edge cases explicitly. LLMs default to clean, typical examples. You must specifically request boundary values, invalid data, Unicode strings, and unusual combinations.

Generate data in batches with variation. A single "generate a user object" prompt produces one user. "Generate 20 users with varied nationalities, ages (including minors and seniors), and unusual name formats" produces a useful test dataset.

Use schema-guided generation. Paste your data schema (JSON schema, TypeScript interface, database table) into the prompt. This ensures the generated data matches your actual data model.

Validate generated data before using it. LLMs sometimes generate data that looks valid but violates real constraints (invalid email formats, future birthdates for deceased users, phone numbers with wrong digit counts). Always run generated data through your validators.

AI-generated data is not production data. Don't use LLMs to replicate production data structure when you need anonymized production data. Use dedicated anonymization tools for that workflow.

Why Test Data Is the Unglamorous Blocker

Test data problems kill test suites. Not code problems — data problems.

A developer writes a comprehensive integration test for an order processing system. The test needs 10 orders: some with multiple items, some with discounts, some with different shipping addresses, some in different states (pending, shipped, cancelled). Creating that data manually takes an hour. The developer writes the test for one order and calls it done.

This is not laziness. It's a rational response to an unrewarding task. Creating diverse, representative test data is slow, tedious, and feels disconnected from the actual problem being solved.

AI changes this calculus. With a well-structured prompt, you can generate 50 varied, realistic test objects in 30 seconds. The bottleneck is reviewing and validating the output, not generating it.

The Fundamentals of AI Test Data Generation

Prompt Structure

The most effective test data prompts follow this pattern:

  1. Schema definition — paste your data model
  2. Volume — how many records
  3. Variation requirements — what aspects should vary
  4. Edge cases — what unusual cases to include
  5. Output format — JSON array, SQL inserts, Python dict, etc.

A minimal prompt:

Generate 15 user objects as a JSON array matching this TypeScript interface:

interface User {
  id: string;        // UUID format
  email: string;
  name: string;
  age: number;
  country: string;   // ISO 3166-1 alpha-2 code
  createdAt: string; // ISO 8601 date
  tier: 'free' | 'pro' | 'enterprise';
}

Requirements:
- Vary countries (include US, UK, DE, JP, BR, NG, IN, AU)
- Ages should range 18–80, include some at boundary (18, 19)
- Mix of tiers: roughly 60% free, 30% pro, 10% enterprise
- Some non-Latin names (Japanese, Arabic, Hindi)
- createdAt dates spread over the last 3 years

Output: valid JSON array only, no explanations.

The schema removes guesswork. The variation requirements prevent the LLM from defaulting to "John Smith, age 30, US, free tier" for every record.

Schema-Guided Generation

Always paste your actual schema. LLMs that generate data without schema context produce structurally plausible but often incompatible objects — wrong field names, wrong types, missing required fields.

For database tables, paste the CREATE TABLE statement:

-- Paste this
CREATE TABLE orders (
  id UUID PRIMARY KEY,
  user_id UUID NOT NULL,
  status VARCHAR(20) NOT NULL CHECK (status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled')),
  total_amount DECIMAL(10,2) NOT NULL,
  currency CHAR(3) NOT NULL,
  created_at TIMESTAMPTZ NOT NULL,
  shipped_at TIMESTAMPTZ,
  items JSONB NOT NULL
);

-- Generate 20 INSERT statements with varied statuses and amounts.
-- shipped_at should be NULL for pending/processing orders and populated for shipped/delivered.
-- items should be a JSON array of {product_id, quantity, unit_price} objects.

For JSON APIs, paste the OpenAPI schema:

# Paste the schema component
UserCreateRequest:
  type: object
  required: [email, name, password]
  properties:
    email:
      type: string
      format: email
    name:
      type: string
      minLength: 1
      maxLength: 100
    password:
      type: string
      minLength: 8
    referralCode:
      type: string
      pattern: '^[A-Z0-9]{8}$'

The LLM respects format constraints (email format, pattern for referralCode) when they're explicit in the schema.

Generating Edge Cases

The default LLM output avoids edge cases. You must request them explicitly.

Common Edge Case Categories

String fields:

  • Empty string (if allowed by schema)
  • Very long strings (near the maxLength limit)
  • Strings with special characters: apostrophes, quotes, ampersands
  • Non-Latin characters: CJK, Arabic, Cyrillic, emoji
  • Strings that look like code: <script>, '; DROP TABLE, ${variable}
  • Whitespace variations: leading/trailing spaces, tab characters, multiple spaces

Numeric fields:

  • Zero
  • Negative numbers (if the field allows them)
  • Very large values (near overflow)
  • Floating point boundary values

Date fields:

  • Past dates vs future dates
  • Dates in different timezones
  • Edge of DST transitions
  • Year 2000, Year 2038 (Unix timestamp limit)

Relationships:

  • Objects with no related records (empty collections)
  • Objects with the maximum allowed related records
  • Circular references (if your data model allows them)

Prompt for edge cases explicitly:

Generate 30 product objects as JSON. Include the following edge cases:
- 3 products with names containing apostrophes (e.g., "Children's Book")
- 2 products with names in Japanese
- 2 products with price exactly 0.00 (free products)
- 2 products with very long descriptions (400+ characters)
- 1 product with an empty string description (allowed by schema)
- 3 products with prices at common boundary values: 0.01, 0.99, 9.99
- The remaining products should be varied typical cases

Domain-Specific Data Generation

Different domains require domain-specific data that general prompts won't produce correctly.

Financial Data

Generate 25 transaction records for a payment processing system. Include:
- Varying amounts from $0.01 to $50,000.00
- Multiple currencies: USD, EUR, GBP, JPY (note JPY has no decimal places)
- Mix of transaction types: charge, refund, chargeback, dispute
- Realistic card brands: Visa, Mastercard, Amex, Discover
- Some failed transactions with error codes: insufficient_funds, card_declined, expired_card
- Timestamps spread across the last 6 months

Format: JSON array. Use cents (integer) for USD amounts, not float.

Specifying "use cents (integer)" prevents the classic floating point money bug that LLMs often introduce.

Address Data

Generate 20 shipping addresses for a US/UK e-commerce system. Include:
- Mix of US and UK addresses (12 US, 8 UK)
- Vary US state codes across multiple regions
- UK postcodes in valid format (e.g., SW1A 2AA, M1 1AE, EH1 1YZ)
- Some apartment/suite numbers
- 2 addresses with only ASCII characters but unusual formatting (all caps, mixed case)
- Valid ZIP codes for the US states used (don't mix state and ZIP code)

Format: JSON array with fields: line1, line2 (nullable), city, state_or_county, postal_code, country_code.

The "don't mix state and ZIP code" instruction prevents a common LLM error where it generates a valid-looking ZIP code that belongs to a different state.

User Profile Data

Generate 40 user profiles for a social platform. Requirements:
- 8 different nationalities with culturally appropriate names
- Ages: 18–75, with 5 users at exactly 18 (boundary)
- 6 users with verified accounts, 34 unverified
- Bio field: mix of empty strings (10 users), short bios (15 users), long bios near 200 character limit (5 users)
- 4 users with display names that contain numbers
- 3 users with display names that contain hyphens or underscores
- Profile picture: 70% with URLs (valid HTTPS image URLs), 30% null

Format: JSON array. Use null for missing optional fields, not empty strings.

Loading Generated Data into Tests

JavaScript/TypeScript with Jest

// fixtures/users.json — generated by AI
import users from './fixtures/users.json';

describe('UserService', () => {
  describe('validateUserAge', () => {
    // Use generated data for parametrized tests
    const adultUsers = users.filter(u => u.age >= 18);
    const minorUsers = users.filter(u => u.age < 18);

    test.each(adultUsers)('accepts adult user: $name (age $age)', (user) => {
      expect(validateUserAge(user)).toBe(true);
    });

    // If you generated minor users for edge cases:
    test.each(minorUsers)('rejects minor user: $name (age $age)', (user) => {
      expect(validateUserAge(user)).toBe(false);
    });
  });
});

Python with pytest

import json
import pytest

with open('fixtures/orders.json') as f:
    test_orders = json.load(f)

pending_orders = [o for o in test_orders if o['status'] == 'pending']
shipped_orders = [o for o in test_orders if o['status'] == 'shipped']

@pytest.mark.parametrize('order', pending_orders)
def test_pending_orders_have_no_shipped_at(order):
    assert order['shipped_at'] is None

@pytest.mark.parametrize('order', shipped_orders)
def test_shipped_orders_have_shipped_at(order):
    assert order['shipped_at'] is not None

Database Seeding

For integration tests that need database-backed data:

-- AI-generated seed file for integration tests
-- Run before test suite: psql test_db < seeds/users.sql

INSERT INTO users (id, email, name, age, country, created_at, tier) VALUES
('550e8400-e29b-41d4-a716-446655440000', 'alice@example.com', 'Alice Johnson', 28, 'US', '2024-01-15T10:00:00Z', 'pro'),
('550e8400-e29b-41d4-a716-446655440001', 'yamamoto@example.jp', '山本 太郎', 42, 'JP', '2023-06-01T08:30:00Z', 'free'),
-- ... 48 more rows

Generate this as a complete SQL file with a prompt like: "Generate 50 SQL INSERT statements for the users table (schema attached). Include a TRUNCATE statement first for clean state. Vary the data as specified."

Validating Generated Data

Never use AI-generated test data without validation. Run it through your application's validators before loading it into tests.

from jsonschema import validate, ValidationError
import json

with open('fixtures/generated-users.json') as f:
    users = json.load(f)

schema = {
    "type": "array",
    "items": {
        "type": "object",
        "required": ["id", "email", "name", "age", "country"],
        "properties": {
            "id": {"type": "string", "format": "uuid"},
            "email": {"type": "string", "format": "email"},
            "age": {"type": "integer", "minimum": 18, "maximum": 120}
        }
    }
}

try:
    validate(users, schema)
    print(f"✓ {len(users)} users validated successfully")
except ValidationError as e:
    print(f"✗ Validation failed: {e.message}")
    print(f"  at: {e.json_path}")

Common validation failures in AI-generated data:

  • Email addresses with invalid format (user @example.com, user@)
  • Phone numbers with wrong digit count for the country
  • Dates in mixed formats (some ISO 8601, some MM/DD/YYYY)
  • UUIDs that aren't valid v4 UUIDs
  • Enum values outside the allowed set

Run validation in a preprocessing step that fails loudly with the specific record that failed. Debugging test failures caused by bad fixture data is expensive.

Iteration Pattern

Start simple and add specificity:

  1. First prompt: Schema + "generate 20 varied examples"
  2. Review output: What's missing? What's wrong?
  3. Second prompt: "Add these specific edge cases: [list what was missing]"
  4. Validate: Run through schema validator
  5. Fix: Ask AI to fix specific records that failed validation
  6. Commit: Store as static fixtures in the repo

This iteration takes 5–15 minutes. Manual creation of the same dataset takes 1–3 hours.

What AI Test Data Generation Doesn't Replace

Production data anonymization. If you need data that reflects real production patterns (distribution of plan tiers, geographic distribution of users, average order values), generate it from production data using anonymization tools, not LLMs. LLMs don't know your production distribution.

Performance and load test data. Generating 10,000 records via an LLM is slow and expensive. Use a seeder library (Faker.js, Python Faker, Java DataFaker) with randomized seeds for large-scale data needs.

Stateful test data. AI generates individual objects well. It generates data with complex interdependencies (a user with 5 orders, each with line items, each with inventory adjustments) less reliably. For complex relational fixtures, write a seed script.

Security test data. Data specifically designed to test for injection vulnerabilities, XSS, or other security issues should come from security testing tools and resources (OWASP test strings), not general-purpose LLMs.

Summary

AI test data generation eliminates the main excuse for not having diverse, comprehensive test fixtures: the time cost. With a well-structured prompt, you can generate production-realistic datasets in minutes rather than hours.

The workflow is: schema + variation requirements → generate → validate → fix → commit. Keep the fixtures in the repo as static files that any test can load. Review them when the schema changes.

The time you save on data generation should go into writing better assertions — the tests that actually catch bugs.

Read more