Synthetic Test Data Generation: Tools and Techniques
Synthetic test data generation solves one of the hardest problems in software testing: how do you create realistic data that exercises your application without using real user information?
This guide covers the tools and techniques that QA teams use to generate test data that's realistic enough to catch real bugs while remaining safe, fast, and repeatable.
What Is Synthetic Test Data?
Synthetic test data is artificially generated data that mimics the structure and statistical characteristics of real data without containing any actual user information.
It's not random garbage — good synthetic data respects the constraints of your domain:
- Email addresses look like email addresses (not
abc123@xyz) - Phone numbers match regional formats
- Dates fall within plausible ranges
- Related records are internally consistent (an order's user ID refers to a real user)
The goal is data that exercises real code paths without touching real PII.
Why Not Use Production Data?
Teams often start by copying production data into their test environments. This seems pragmatic but creates compounding problems:
Compliance risk — even "anonymized" production data can be re-identified. GDPR, HIPAA, and SOC 2 treat test environments with production data as in scope.
Volume — production databases are often too large for fast test runs. A 500GB production dump is impractical in CI.
Coupling — your tests become dependent on specific production records that change over time.
Staleness — data copied six months ago doesn't reflect current business rules or validation constraints.
Synthetic data avoids all of these.
Core Technique: Faker Libraries
Faker libraries are the foundation of synthetic data generation. They provide realistic values for common data types:
import { faker } from '@faker-js/faker';
const user = {
name: faker.person.fullName(), // "Marilyn Flatley"
email: faker.internet.email(), // "marilyn.flatley@example.com"
phone: faker.phone.number(), // "+1 555-867-5309"
address: faker.location.streetAddress(), // "742 Evergreen Terrace"
birthdate: faker.date.birthdate(), // 1987-04-23T00:00:00.000Z
};Popular Faker libraries by language:
| Language | Library | Stars |
|---|---|---|
| JavaScript/TypeScript | @faker-js/faker |
13k+ |
| Python | Faker |
18k+ |
| Ruby | faker gem |
11k+ |
| Java | JavaFaker |
4k+ |
| Go | gofakeit |
4k+ |
| PHP | FakerPHP/Faker |
12k+ |
Faker is best for generating individual field values. For generating entire object graphs with relationships, you need something more.
Factories: Structured Synthetic Data
A factory combines Faker values with your domain model to produce valid, internally consistent objects:
// Using fishery (TypeScript)
import { Factory } from 'fishery';
import { faker } from '@faker-js/faker';
const userFactory = Factory.define<User>(() => ({
id: faker.string.uuid(),
name: faker.person.fullName(),
email: faker.internet.email(),
role: 'user',
createdAt: faker.date.past(),
}));
const orderFactory = Factory.define<Order>(({ associations }) => ({
id: faker.string.uuid(),
userId: associations.user?.id ?? userFactory.build().id,
total: faker.number.float({ min: 10, max: 1000, precision: 0.01 }),
status: 'pending',
}));
// Usage
const admin = userFactory.build({ role: 'admin' });
const order = orderFactory.build({}, { associations: { user: admin } });Factories handle:
- Default values for fields you don't care about in a specific test
- Relationship management (order belongs to user)
- Trait variants (admin user, suspended user, verified user)
Tool: Mockaroo
Mockaroo is a web-based tool for generating large datasets in CSV, JSON, SQL, or Excel format. You define a schema (column names + data types) and download up to 1,000 rows free.
Best for:
- One-time data dumps for performance testing
- Seeding databases with realistic baseline data
- Teams that need data without writing code
Limitations:
- Not integrated into your test suite directly
- Data is static once downloaded
- Free tier limits (1,000 rows, no API)
Tool: Mimesis
Mimesis is a Python library similar to Faker but with better performance and more locale support:
from mimesis import Person, Address
from mimesis.locales import Locale
person = Person(Locale.EN)
address = Address(Locale.EN)
print(person.full_name()) # "Emma Johnson"
print(address.city()) # "San Francisco"
print(person.email()) # "emma.johnson@example.com"Mimesis is about 10x faster than Faker for large-scale generation, which matters when seeding databases with millions of rows.
Tool: Gretel.ai
Gretel is an AI-powered synthetic data platform. Instead of configuring schemas manually, you give it a sample of your real data (or a schema), and it generates statistically similar synthetic data that preserves the distributions and correlations of the original.
What makes it different:
- Learns statistical distributions from real data (without storing the real data)
- Preserves correlations between columns (if age and income correlate in real data, they correlate in synthetic data)
- Built-in privacy guarantees (differential privacy, re-identification risk scores)
- Works on structured data, time series, and text
Best for: Compliance-heavy industries (healthcare, finance) where you need to prove the synthetic data is safe.
Tool: Tonic.ai
Tonic is an enterprise synthetic data platform positioned specifically as a "production data de-risker." It connects to your production database, learns its structure and relationships, and generates a synthetic replica that's safe for non-production environments.
Key features:
- Preserves referential integrity across tables automatically
- Subsetting: generate a 1% sample that still covers all code paths
- Consistent anonymization (the same real email always maps to the same fake email)
- Works with PostgreSQL, MySQL, SQL Server, MongoDB, and more
Best for: Large-scale enterprise applications with complex relational schemas.
Technique: Domain-Specific Generators
For domains with specialized data formats, generic Faker values aren't enough. You need generators that understand your domain's rules:
Healthcare:
# Valid FHIR-compliant patient resource
patient = {
"resourceType": "Patient",
"id": faker.uuid4(),
"identifier": [{"system": "http://hospital.smarthealthit.org", "value": faker.numerify("######")}],
"name": [{"use": "official", "family": faker.last_name(), "given": [faker.first_name()]}],
"gender": random.choice(["male", "female", "other", "unknown"]),
"birthDate": faker.date_of_birth(minimum_age=0, maximum_age=120).isoformat(),
}Financial:
- Valid credit card numbers (Luhn algorithm check)
- Realistic transaction amounts that match merchant categories
- Plausible account balances
E-commerce:
- SKUs that match your product taxonomy
- Orders where quantities and prices produce plausible totals
- Shipping addresses in regions where you actually ship
Generic Faker gives you the fields. Domain knowledge gives you the constraints.
Technique: State Machine-Based Generation
Some test scenarios require data that tells a coherent story over time. An order can't be shipped before it's confirmed. A user can't log in before they're verified.
State machine generators produce sequences of records that respect these temporal constraints:
def generate_order_lifecycle():
order_id = uuid4()
user_id = uuid4()
events = [
{"type": "order_created", "order_id": order_id, "user_id": user_id, "at": now()},
{"type": "payment_confirmed", "order_id": order_id, "at": now() + minutes(5)},
{"type": "order_shipped", "order_id": order_id, "tracking": faker.numerify("1Z###########"), "at": now() + hours(24)},
{"type": "order_delivered", "order_id": order_id, "at": now() + days(3)},
]
return eventsThis produces event sequences that are internally consistent — useful for testing event-sourced systems or analytics pipelines.
Choosing the Right Tool
| Use Case | Recommended Tool |
|---|---|
| Unit/integration tests | Faker + Factories |
| Database seeding | Faker + seed scripts |
| Large dataset generation | Mockaroo or Mimesis |
| Compliance-sensitive data | Gretel.ai or Tonic.ai |
| Complex relational data | Tonic.ai |
| Domain-specific data | Custom generators on top of Faker |
Performance Considerations
Generating thousands of records in-process during test runs adds latency. Strategies to mitigate:
- Pre-generate seed data — generate once, commit to the repo, load from file in CI
- Use in-memory databases (SQLite) for unit tests — much faster than PostgreSQL
- Lazy generation — only generate data for the specific test scenario, not a full database
- Caching — cache factory outputs that are expensive to generate (e.g., bcrypt-hashed passwords)
// Cache expensive operations
const hashedPassword = bcrypt.hashSync('testpassword', 10);
const userFactory = Factory.define<User>(() => ({
passwordHash: hashedPassword, // reuse the cached hash
// ...
}));Summary
Synthetic test data generation is now a mature discipline with excellent tooling. The right approach depends on your scale and compliance requirements:
- Start with Faker + factories for most teams
- Add Gretel or Tonic when compliance requirements demand it
- Build domain-specific generators for specialized data formats
- Use state machine generation for complex temporal scenarios
Good synthetic data lets your tests run faster, stay compliant, and remain independent of production — which is exactly where you want them.