Fuzz Testing

Property-Based and Fuzz Testing in Python with Hypothesis

HelpMeTest

16 May 2026 — 7 min read

Hypothesis is the most accessible entry point to property-based and fuzz testing for Python developers. Unlike AFL++ or libFuzzer which require C/C++ code and complex toolchains, Hypothesis works with any Python function, integrates with pytest, and produces minimal counterexamples automatically. It's been finding real bugs in production Python code since 2013.

What Is Property-Based Testing?

Traditional unit tests check specific examples:

def test_sort():
    assert sorted([3, 1, 2]) == [1, 2, 3]
    assert sorted([]) == []
    assert sorted([1]) == [1]

Property-based tests check properties that should hold for any input:

from hypothesis import given
from hypothesis import strategies as st

@given(st.lists(st.integers()))
def test_sort_properties(items):
    result = sorted(items)
    
    # Property 1: Same length
    assert len(result) == len(items)
    
    # Property 2: Sorted order
    assert all(result[i] <= result[i+1] for i in range(len(result) - 1))
    
    # Property 3: Same elements
    assert sorted(result) == result  # Idempotent
    assert set(result) == set(items)  # Same values

Hypothesis generates hundreds of inputs, trying to falsify your properties. When it finds a failure, it shrinks the input to the minimal counterexample — not the randomly generated one, but the simplest possible input that triggers the bug.

Installation

pip install hypothesis pytest

For extras:

pip install hypothesis[numpy]    # NumPy array strategies
pip install hypothesis[pandas]   <span class="hljs-comment"># Pandas DataFrame strategies  
pip install hypothesis[django]   <span class="hljs-comment"># Django model strategies

Core Strategies

Strategies are generators that produce test data. They're composable — build complex structures from simple ones.

Basic Strategies

from hypothesis import strategies as st

# Primitive types
st.integers()                          # All integers
st.integers(min_value=0, max_value=100)  # Bounded integers
st.floats()                            # Including NaN, inf
st.floats(allow_nan=False, allow_infinity=False)
st.text()                              # Unicode strings
st.text(alphabet=st.characters(whitelist_categories=['L', 'N']))  # Letters+digits only
st.booleans()
st.none()

# Collections
st.lists(st.integers())
st.lists(st.integers(), min_size=1, max_size=10)
st.tuples(st.integers(), st.text())
st.dictionaries(keys=st.text(), values=st.integers())
st.frozensets(st.integers())

# Sampling
st.sampled_from(['red', 'green', 'blue'])
st.one_of(st.integers(), st.text(), st.none())  # Any of these types

Composition

# Nested structures
st.lists(st.dictionaries(
    keys=st.text(min_size=1, max_size=20),
    values=st.one_of(st.integers(), st.text(), st.none())
))

# Chaining with map and filter
st.integers().map(abs)  # Always positive
st.integers().filter(lambda x: x % 2 == 0)  # Only even numbers

# Building objects
st.builds(MyClass, 
    id=st.integers(min_value=1),
    name=st.text(min_size=1)
)

Custom Strategies with @composite

from hypothesis.strategies import composite

@composite
def valid_email(draw):
    """Generate syntactically valid email addresses"""
    user = draw(st.text(
        alphabet=st.characters(whitelist_categories=['L', 'N']),
        min_size=1, max_size=20
    ))
    domain = draw(st.text(
        alphabet=st.characters(whitelist_categories=['L', 'N']),
        min_size=2, max_size=10
    ))
    tld = draw(st.sampled_from(['com', 'org', 'net', 'io', 'co']))
    return f"{user}@{domain}.{tld}"

@composite
def valid_date_range(draw):
    """Generate a start date that is always before end date"""
    from datetime import date, timedelta
    start = draw(st.dates(min_value=date(2000, 1, 1), max_value=date(2030, 12, 31)))
    days = draw(st.integers(min_value=1, max_value=365))
    end = start + timedelta(days=days)
    return start, end

@given(valid_date_range())
def test_date_range_logic(date_range):
    start, end = date_range
    assert start < end  # Always true by construction
    
    schedule = create_schedule(start, end)
    assert len(schedule) >= 1

Writing Effective Properties

The hardest part of property-based testing is identifying what to test. Here are patterns:

Round-Trip Properties

If encode/decode are inverse operations, round-trips should be identity:

import json
from hypothesis import given
from hypothesis import strategies as st

@given(st.recursive(
    st.one_of(
        st.none(),
        st.booleans(),
        st.integers(-10**6, 10**6),
        st.floats(allow_nan=False, allow_infinity=False),
        st.text()
    ),
    lambda children: st.one_of(
        st.lists(children, max_size=5),
        st.dictionaries(st.text(), children, max_size=5)
    ),
    max_leaves=20
))
def test_json_round_trip(value):
    """JSON encode/decode must be identity for serializable values"""
    assert json.loads(json.dumps(value)) == value

Idempotency

Applying an operation twice should give the same result as once:

@given(st.text())
def test_normalize_idempotent(text):
    once = normalize(text)
    twice = normalize(once)
    assert once == twice, f"normalize not idempotent: {text!r} → {once!r} → {twice!r}"

@given(st.lists(st.integers()))
def test_sort_idempotent(items):
    assert sorted(sorted(items)) == sorted(items)

Commutativity / Associativity

Operations that should be order-independent:

@given(st.integers(), st.integers())
def test_addition_commutative(a, b):
    assert a + b == b + a

@given(st.sets(st.integers()), st.sets(st.integers()))
def test_union_commutative(a, b):
    assert a | b == b | a

Invariant Preservation

Operations should preserve certain invariants:

@given(st.lists(st.integers()), st.integers())
def test_insert_increases_length(items, new_item):
    collection = MyCollection(items)
    original_size = len(collection)
    
    collection.insert(new_item)
    
    assert len(collection) == original_size + 1, "insert must increase size by 1"
    assert new_item in collection, "inserted item must be findable"

@given(st.text(), st.text())
def test_string_concat_length(a, b):
    assert len(a + b) == len(a) + len(b)

Comparative / Differential Testing

Two implementations must agree:

@given(st.text())
def test_regex_engines_agree(pattern_and_text):
    # Compare two regex implementations
    import re
    import regex  # Drop-in replacement with more features
    
    try:
        expected = re.match(some_pattern, pattern_and_text)
        actual = regex.match(some_pattern, pattern_and_text)
        
        # Both should succeed or both should fail
        assert bool(expected) == bool(actual)
        
        if expected and actual:
            assert expected.group(0) == actual.group(0)
    except re.error:
        pass  # Pattern errors are OK

Settings and Configuration

from hypothesis import given, settings, HealthCheck, Phase

@given(st.text())
@settings(
    max_examples=1000,          # More examples = more thorough
    deadline=5000,              # 5 second max per test (ms)
    suppress_health_check=[HealthCheck.too_slow],
    deriving=True               # Enable shrinking (default)
)
def test_with_custom_settings(text):
    process(text)

# Profile-based settings for different environments
settings.register_profile("ci", max_examples=200)
settings.register_profile("dev", max_examples=50)
settings.register_profile("thorough", max_examples=10000)

settings.load_profile("ci")  # Or set HYPOTHESIS_PROFILE=ci

The Hypothesis Database

Hypothesis remembers failing examples in a local database (~/.hypothesis/examples/ or .hypothesis/). Next time you run the test, it replays known failures first. This means:

Regressions are caught immediately
CI doesn't need to re-discover known bugs
Shrunk minimal examples are stored

# Configure database location
from hypothesis import settings
from hypothesis.database import DirectoryBasedExampleDatabase

settings.register_profile("ci", 
    database=DirectoryBasedExampleDatabase(".hypothesis")
)

Commit .hypothesis/ to git to share known examples with your team.

Stateful Testing

For testing stateful systems (queues, databases, UIs), Hypothesis can generate sequences of operations:

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize

class BankAccountMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.balance = 0
        self.account = BankAccount()
    
    @initialize(amount=st.integers(min_value=1, max_value=10000))
    def create_account(self, amount):
        """Initialize with a starting balance"""
        self.balance = amount
        self.account.deposit(amount)
    
    @rule(amount=st.integers(min_value=1, max_value=1000))
    def deposit(self, amount):
        self.balance += amount
        self.account.deposit(amount)
    
    @rule(amount=st.integers(min_value=1, max_value=1000))
    def withdraw(self, amount):
        if amount <= self.balance:
            self.balance -= amount
            self.account.withdraw(amount)
    
    @invariant()
    def balance_matches(self):
        """At all times, our model must match the actual account"""
        assert self.account.get_balance() == self.balance, \
            f"Balance mismatch: model={self.balance}, actual={self.account.get_balance()}"
    
    @invariant()
    def balance_non_negative(self):
        assert self.balance >= 0, "Balance went negative"

# pytest automatically discovers this as a test
TestBankAccount = BankAccountMachine.TestCase

Stateful testing generates random sequences of operations and checks that invariants hold throughout. This is extraordinarily effective for finding bugs in state machines.

Django and Database Testing

from hypothesis.extra.django import TestCase, from_model
from myapp.models import Product, Category

class ProductPropertyTests(TestCase):
    @given(from_model(
        Product,
        name=st.text(min_size=1, max_size=200),
        price=st.decimals(min_value='0.01', max_value='9999.99', places=2),
    ))
    def test_product_slug_is_unique_when_name_is_unique(self, product):
        # from_model creates real database rows and cleans up
        assert product.pk is not None
        assert len(product.slug) > 0

from_model creates real Django model instances, respecting constraints and foreign keys. It cleans up after each test.

Shrinking: Why It Matters

When Hypothesis finds a failure, it shrinks the input to the minimal counterexample. This is one of its most valuable features:

@given(st.lists(st.integers()))
def test_sum_of_list(items):
    result = sum_naive(items)
    assert result == sum(items)

If sum_naive has a bug on lists with more than 1000 elements, Hypothesis won't just show you the 10,000-element list it randomly generated. It will show you [0, 0, ..., 0, 1] with exactly 1001 elements — the smallest input that triggers the bug.

This makes debugging dramatically faster. Most bugs have minimal triggers that a human would never generate manually.

Fuzzing vs Property Testing

Hypothesis blurs the line between fuzzing and property testing:

Fuzzing: "Generate random inputs, look for crashes"
Property testing: "Generate random inputs, check these properties"

Hypothesis does both. When you write @given(st.text()) with a function that just calls your_function(text) without assertions, you're fuzzing for crashes. When you add assertions, you're property testing.

# Fuzzing mode: just looking for panics
@given(st.binary())
def fuzz_deserializer(data):
    try:
        deserialize(data)
    except (ValueError, TypeError):
        pass  # Expected errors are OK

# Property mode: checking invariants
@given(st.text())
def test_parser_properties(text):
    result = parse(text)
    if result.success:
        assert result.value is not None
        assert serialize(result.value) != ""

CI Integration

# GitHub Actions
- name: Run Hypothesis tests
  run: pytest tests/ -x --hypothesis-seed=12345
  env:
    HYPOTHESIS_PROFILE: ci

# Store database for persistence across runs
- name: Cache Hypothesis database
  uses: actions/cache@v4
  with:
    path: .hypothesis/
    key: hypothesis-${{ hashFiles('tests/**/*.py') }}

Using --hypothesis-seed with a fixed seed makes test runs reproducible in CI (while still being random). Different seeds find different bugs, so periodically vary it.

Common Gotchas

1. Tests too slow? Use @settings(max_examples=50) locally, 200 in CI.

2. Flaky tests? Usually means your code has implicit assumptions Hypothesis is violating. Add explicit preconditions with assume():

from hypothesis import given, assume

@given(st.integers(), st.integers())
def test_division(a, b):
    assume(b != 0)  # Skip examples where b is 0
    assert isinstance(a / b, float)

3. Complex data too slow to generate? Use st.data() for lazy generation:

from hypothesis import given, strategies as st

@given(st.data())
def test_related_data(data):
    n = data.draw(st.integers(min_value=1, max_value=100))
    items = data.draw(st.lists(st.integers(), min_size=n, max_size=n))
    # items is always length n

From Testing to Production Monitoring

Hypothesis tests properties that always hold — but production has data and conditions your tests can't generate. Database state built up over years, third-party API behavior, concurrent user interactions.

HelpMeTest runs continuous end-to-end tests against your live application 24/7. While Hypothesis catches algorithmic bugs, HelpMeTest catches integration failures, UI regressions, and runtime degradation that only show up in production. Write tests in plain English — no code required. $100/month flat.

Summary

Hypothesis is the most powerful Python testing library most developers aren't using. It:

Finds bugs that example-based tests miss
Shrinks failures to minimal, readable counterexamples
Works with any Python function, with or without assertions
Integrates seamlessly with pytest
Provides stateful testing for complex, multi-step scenarios

Start with your parsing functions, data validators, and any function that transforms data. Add @given(st.text()) and run pytest. Hypothesis will find something surprising — it always does.