Python Faker Library: Generate Realistic Test Data in Seconds

Python Faker Library: Generate Realistic Test Data in Seconds

Hardcoded test data is one of the most persistent sources of flaky tests and maintenance burden in software projects. When you write name = "John Doe" in a test fixture, you've coupled your test to a specific string that means nothing. When you write email = "test@example.com" across 40 test files, you've created a fragile lattice that breaks the moment uniqueness constraints matter.

The Python Faker library solves this by generating realistic, varied test data on demand. It's not a toy — it's a production-grade tool used in test suites at scale, and understanding it deeply will change how you write tests.

Installing Faker

pip install faker

That's it. No C extensions, no system dependencies. The library is pure Python and installs in seconds. For projects using pyproject.toml:

[project.optional-dependencies]
test = [
    "faker>=24.0.0",
    "pytest>=8.0.0",
]

Basic Usage: Your First Faker Instance

from faker import Faker

fake = Faker()

print(fake.name())          # "Jennifer Martinez"
print(fake.email())         # "michael.chen@example.org"
print(fake.address())       # "742 Evergreen Terrace\nSpringfield, OR 97477"
print(fake.phone_number())  # "(503) 555-0142"
print(fake.company())       # "Initech Solutions LLC"

Each call generates a new random value. The data is coherent — names look like names, phone numbers follow real formats, addresses have proper zip codes for the right states.

Common Providers

Faker organizes its generators into providers. Each provider covers a domain. Here's what you'll reach for most:

Personal Data

fake.name()           # "Dr. Sarah O'Brien"
fake.first_name()     # "Marcus"
fake.last_name()      # "Johansson"
fake.prefix()         # "Ms."
fake.suffix()         # "Jr."
fake.user_name()      # "sarah.obrien92"
fake.password()       # "K#9mPxQr2!"
fake.ssn()            # "743-22-1801"
fake.date_of_birth(minimum_age=18, maximum_age=75)  # datetime.date(1978, 3, 15)

Contact and Location

fake.email()                    # "user@domain.com"
fake.safe_email()               # "user@example.com" (safe for tests)
fake.free_email()               # "user@gmail.com"
fake.phone_number()             # "(212) 555-8432"
fake.address()                  # Full multi-line address
fake.street_address()           # "1847 Oak Street"
fake.city()                     # "Portland"
fake.state()                    # "Oregon"
fake.state_abbr()               # "OR"
fake.zipcode()                  # "97201"
fake.country()                  # "United States"
fake.latitude()                 # Decimal('45.523064')
fake.longitude()                # Decimal('-122.676483')

Internet and Tech

fake.url()              # "https://www.example-site.org/path"
fake.domain_name()      # "example-corp.net"
fake.ipv4()             # "192.168.1.142"
fake.ipv6()             # "2001:db8::1428:57ab"
fake.mac_address()      # "a1:b2:c3:d4:e5:f6"
fake.user_agent()       # "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
fake.slug()             # "realistic-test-data-generation"
fake.uuid4()            # "550e8400-e29b-41d4-a716-446655440000"

Financial

fake.credit_card_number()          # "4532015112830366"
fake.credit_card_provider()        # "Visa"
fake.credit_card_expire()          # "03/28"
fake.credit_card_security_code()   # "412"
fake.iban()                        # "GB29NWBK60161331926819"
fake.currency_code()               # "USD"
fake.pricetag()                    # "$42.99"

Text and Lorem

fake.word()                     # "synthesize"
fake.words(nb=5)                # ["cloud", "agile", "leverage", "synergy", "pivot"]
fake.sentence()                 # "The quick brown fox jumps over..."
fake.sentences(nb=3)            # List of 3 sentences
fake.paragraph()                # A full paragraph
fake.text(max_nb_chars=200)     # Text truncated to 200 chars

Dates and Times

fake.date()                              # "2023-07-14"
fake.date_time()                         # datetime(2023, 7, 14, 10, 23, 47)
fake.date_time_between(                  # Within a range
    start_date="-1y",
    end_date="now"
)
fake.future_date()                       # A date in the future
fake.past_date()                         # A date in the past
fake.time_zone()                         # "America/Los_Angeles"
fake.unix_time()                         # 1689332627

Seeding for Reproducible Data

Random data is great until a test fails and you can't reproduce it. Faker supports seeding, which makes the same sequence of calls produce the same results every time.

from faker import Faker

fake = Faker()
fake.seed_instance(12345)

print(fake.name())   # Always "Jennifer Martinez" with this seed
print(fake.email())  # Always the same email with this seed

Important: seed_instance() seeds this specific instance. Use it when you need reproducibility scoped to one test. For global seeding (affects all instances), use Faker.seed(12345) — but be careful, this affects the shared random state and can cause surprising cross-test interactions.

A practical pattern for reproducible tests:

import pytest
from faker import Faker

@pytest.fixture
def fake():
    f = Faker()
    f.seed_instance(0)  # Deterministic within this test
    return f

def test_user_creation(fake):
    user_data = {
        "name": fake.name(),
        "email": fake.email(),
    }
    # Same data every time this test runs
    user = create_user(**user_data)
    assert user.id is not None

Creating Custom Providers

The built-in providers cover a lot, but your domain has specific vocabulary. Maybe you need fake product SKUs, internal department codes, or realistic-looking API keys. Custom providers let you extend Faker with your domain's language.

from faker import Faker
from faker.providers import BaseProvider

class ProductProvider(BaseProvider):
    CATEGORIES = ["Electronics", "Clothing", "Books", "Home", "Sports"]
    BRANDS = ["Acme", "Globex", "Initech", "Umbrella", "Weyland"]

    def product_sku(self):
        category_code = self.random_element(["EL", "CL", "BK", "HM", "SP"])
        number = self.numerify("######")
        return f"{category_code}-{number}"

    def product_name(self):
        brand = self.random_element(self.BRANDS)
        adjective = self.random_element(["Premium", "Pro", "Elite", "Basic", "Ultra"])
        noun = self.random_element(["Widget", "Gadget", "Device", "Tool", "Kit"])
        return f"{brand} {adjective} {noun}"

    def product_category(self):
        return self.random_element(self.CATEGORIES)


fake = Faker()
fake.add_provider(ProductProvider)

print(fake.product_sku())       # "EL-483920"
print(fake.product_name())      # "Acme Pro Widget"
print(fake.product_category())  # "Electronics"

You can also create providers that combine multiple Faker calls into coherent domain objects:

class OrderProvider(BaseProvider):
    def order_status(self):
        return self.random_element([
            "pending", "processing", "shipped", "delivered", "cancelled"
        ])

    def tracking_number(self):
        carrier = self.random_element(["UPS", "FEDEX", "USPS"])
        number = self.numerify("1Z###############")
        return f"{carrier}{number}"

Integrating Faker with pytest Fixtures

The real power of Faker emerges when you wire it into pytest's fixture system. Instead of one-off calls, you build composable data factories that your entire test suite can consume.

# conftest.py
import pytest
from faker import Faker

fake = Faker()


@pytest.fixture
def fake_user():
    return {
        "id": fake.uuid4(),
        "name": fake.name(),
        "email": fake.safe_email(),
        "username": fake.user_name(),
        "phone": fake.phone_number(),
        "created_at": fake.date_time_between(start_date="-2y", end_date="now").isoformat(),
        "is_active": True,
    }


@pytest.fixture
def fake_address():
    return {
        "street": fake.street_address(),
        "city": fake.city(),
        "state": fake.state_abbr(),
        "zip": fake.zipcode(),
        "country": "US",
    }


@pytest.fixture
def fake_user_with_address(fake_user, fake_address):
    return {**fake_user, "address": fake_address}

Now your tests become readable:

def test_user_profile_update(client, fake_user_with_address, auth_headers):
    response = client.put(
        f"/api/users/{fake_user_with_address['id']}",
        json=fake_user_with_address,
        headers=auth_headers,
    )
    assert response.status_code == 200
    assert response.json()["email"] == fake_user_with_address["email"]

For fixtures that need to create database records, combine with a cleanup mechanism:

@pytest.fixture
def db_user(db_session, fake_user):
    user = User(**fake_user)
    db_session.add(user)
    db_session.commit()
    yield user
    db_session.delete(user)
    db_session.commit()

Faker with factory_boy

factory_boy is a fixtures library that pairs naturally with Faker. Where pytest fixtures are function-level, factory_boy factories are class-level — you define a factory once and instantiate it anywhere.

pip install factory-boy
import factory
from factory import Faker as FactoryFaker
from myapp.models import User, Order, Product


class UserFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = User

    name = FactoryFaker("name")
    email = FactoryFaker("safe_email")
    username = FactoryFaker("user_name")
    phone = FactoryFaker("phone_number")
    is_active = True
    date_joined = FactoryFaker("date_time_this_year")


class ProductFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Product

    name = FactoryFaker("catch_phrase")
    price = factory.LazyFunction(lambda: round(fake.pyfloat(min_value=1, max_value=500), 2))
    sku = factory.Sequence(lambda n: f"SKU-{n:06d}")
    category = FactoryFaker("word")


class OrderFactory(factory.django.DjangoModelFactory):
    class Meta:
        model = Order

    user = factory.SubFactory(UserFactory)
    product = factory.SubFactory(ProductFactory)
    quantity = factory.LazyFunction(lambda: fake.random_int(min=1, max=10))
    status = factory.LazyFunction(lambda: fake.random_element(["pending", "shipped", "delivered"]))
    created_at = FactoryFaker("date_time_this_month")

Using factories in tests:

def test_order_total_calculation():
    product = ProductFactory(price=29.99)
    order = OrderFactory(product=product, quantity=3)
    assert order.total == 89.97


def test_bulk_order_history():
    user = UserFactory()
    orders = OrderFactory.create_batch(10, user=user)
    assert len(user.orders.all()) == 10

Bulk Data Generation for Database Seeding

Sometimes you need thousands of realistic records — for performance testing, staging environment setup, or development databases. Faker handles bulk generation efficiently.

import json
from faker import Faker

fake = Faker()


def generate_users(count: int) -> list[dict]:
    users = []
    seen_emails = set()

    for _ in range(count):
        email = fake.safe_email()
        # Ensure uniqueness without database round-trips
        while email in seen_emails:
            email = fake.safe_email()
        seen_emails.add(email)

        users.append({
            "name": fake.name(),
            "email": email,
            "username": fake.user_name(),
            "bio": fake.text(max_nb_chars=160),
            "location": f"{fake.city()}, {fake.state_abbr()}",
            "created_at": fake.date_time_between(
                start_date="-3y", end_date="now"
            ).isoformat(),
        })

    return users


def generate_products(count: int) -> list[dict]:
    return [
        {
            "id": i + 1,
            "name": fake.catch_phrase(),
            "description": fake.paragraph(nb_sentences=3),
            "price": round(fake.pyfloat(min_value=5, max_value=1000), 2),
            "stock": fake.random_int(min=0, max=500),
            "category": fake.random_element([
                "Electronics", "Books", "Clothing", "Home", "Sports"
            ]),
        }
        for i in range(count)
    ]


if __name__ == "__main__":
    users = generate_users(10_000)
    products = generate_products(1_000)

    with open("seed_users.json", "w") as f:
        json.dump(users, f, indent=2)

    with open("seed_products.json", "w") as f:
        json.dump(products, f, indent=2)

    print(f"Generated {len(users)} users and {len(products)} products")

For SQL databases, combine with psycopg2 or sqlalchemy and use batch inserts:

from sqlalchemy.orm import Session

def seed_database(session: Session, user_count: int = 1000):
    users = [User(**u) for u in generate_users(user_count)]
    session.bulk_save_objects(users)
    session.commit()
    print(f"Seeded {user_count} users")

Locale-Specific Data

One of Faker's most underused features is locale support. If your app serves international users, your tests should reflect that.

from faker import Faker

# German locale
de_fake = Faker("de_DE")
print(de_fake.name())           # "Hans-Werner Müller"
print(de_fake.address())        # "Hauptstraße 42\n10115 Berlin"
print(de_fake.phone_number())   # "+49 30 12345678"
print(de_fake.iban())           # "DE89370400440532013000"

# Japanese locale
ja_fake = Faker("ja_JP")
print(ja_fake.name())           # "田中 太郎"
print(ja_fake.address())        # "東京都渋谷区..."
print(ja_fake.phone_number())   # "090-1234-5678"

# Brazilian Portuguese
pt_fake = Faker("pt_BR")
print(pt_fake.name())           # "João da Silva"
print(pt_fake.cpf())            # "123.456.789-09" (Brazilian tax ID)

# Mixed locales in one instance
multi_fake = Faker(["en_US", "fr_FR", "de_DE", "ja_JP"])
# Randomly switches between locales
for _ in range(5):
    print(multi_fake.name())

Locale-aware testing catches real bugs — date format assumptions, character encoding issues, right-to-left text handling, and address field length constraints.

Performance Considerations

Faker is fast for unit and integration tests, but at scale you need to be deliberate.

Profile before optimizing. For most test suites, Faker's overhead is negligible. The bottleneck is almost always I/O (database writes, HTTP calls), not data generation.

Avoid per-call instantiation. Create one Faker() instance and reuse it:

# Bad — creates a new instance (and loads all providers) every call
def get_fake_user():
    fake = Faker()  # Expensive!
    return {"name": fake.name()}

# Good — module-level instance
fake = Faker()

def get_fake_user():
    return {"name": fake.name()}

Use list comprehensions for bulk generation. They're faster than explicit loops:

# Slower
users = []
for i in range(10000):
    users.append({"name": fake.name(), "email": fake.email()})

# Faster
users = [{"name": fake.name(), "email": fake.email()} for _ in range(10000)]

Cache expensive operations. If you're generating the same kind of data repeatedly, consider pre-generating a pool and sampling from it:

# Pre-generate a pool for high-volume tests
USER_POOL = [
    {"name": fake.name(), "email": fake.safe_email()}
    for _ in range(10_000)
]

def get_fake_user():
    return fake.random_element(USER_POOL)

Disable unused providers. When Faker initializes, it loads all providers. You can speed up instantiation by specifying only what you need — though in practice this matters only if you're creating many instances:

from faker import Faker
from faker.providers import person, internet

fake = Faker()
fake.add_provider(person)
fake.add_provider(internet)

Putting It All Together: A Realistic Test Dataset

Here's a complete example that generates a coherent dataset — users, products, and orders that reference each other correctly:

from faker import Faker
from datetime import datetime

fake = Faker()
fake.seed_instance(42)  # Reproducible dataset


def generate_test_dataset(
    user_count: int = 100,
    product_count: int = 50,
    order_count: int = 500,
) -> dict:
    users = [
        {
            "id": i + 1,
            "name": fake.name(),
            "email": fake.safe_email(),
            "created_at": fake.date_time_between("-2y", "now").isoformat(),
        }
        for i in range(user_count)
    ]

    products = [
        {
            "id": i + 1,
            "name": fake.catch_phrase(),
            "price": round(fake.pyfloat(min_value=1, max_value=500), 2),
            "category": fake.random_element(["A", "B", "C", "D"]),
        }
        for i in range(product_count)
    ]

    user_ids = [u["id"] for u in users]
    product_ids = [p["id"] for p in products]

    orders = [
        {
            "id": i + 1,
            "user_id": fake.random_element(user_ids),
            "product_id": fake.random_element(product_ids),
            "quantity": fake.random_int(min=1, max=5),
            "status": fake.random_element(["pending", "shipped", "delivered"]),
            "created_at": fake.date_time_between("-1y", "now").isoformat(),
        }
        for i in range(order_count)
    ]

    return {"users": users, "products": products, "orders": orders}

This dataset is deterministic (seeded), internally consistent (orders reference valid user and product IDs), and realistic enough to catch real bugs.

The Python Faker library removes the excuse for lazy test data. There's no reason to write "test@test.com" when fake.safe_email() takes the same effort and produces data that actually exercises your system. Start with the basics, build custom providers for your domain, and wire it into pytest fixtures — your test suite will be more thorough, more maintainable, and more honest about what it's testing.

Read more

ScyllaDB Testing Guide: Cassandra Driver Compatibility, Shard-per-Core Testing & Performance Regression

ScyllaDB Testing Guide: Cassandra Driver Compatibility, Shard-per-Core Testing & Performance Regression

ScyllaDB delivers Cassandra-compatible APIs with a rewritten Seastar-based engine that achieves dramatically higher throughput. Testing ScyllaDB applications requires validating both Cassandra compatibility and ScyllaDB-specific behaviors like shard-per-core data distribution. This guide covers both angles. ScyllaDB Testing Landscape ScyllaDB is a drop-in replacement for Cassandra at the API level—which means

By HelpMeTest