Property-Based and Fuzz Testing in Python with Hypothesis
Hypothesis is the most accessible entry point to property-based and fuzz testing for Python developers. Unlike AFL++ or libFuzzer which require C/C++ code and complex toolchains, Hypothesis works with any Python function, integrates with pytest, and produces minimal counterexamples automatically. It's been finding real bugs in production Python code since 2013.
What Is Property-Based Testing?
Traditional unit tests check specific examples:
def test_sort():
assert sorted([3, 1, 2]) == [1, 2, 3]
assert sorted([]) == []
assert sorted([1]) == [1]Property-based tests check properties that should hold for any input:
from hypothesis import given
from hypothesis import strategies as st
@given(st.lists(st.integers()))
def test_sort_properties(items):
result = sorted(items)
# Property 1: Same length
assert len(result) == len(items)
# Property 2: Sorted order
assert all(result[i] <= result[i+1] for i in range(len(result) - 1))
# Property 3: Same elements
assert sorted(result) == result # Idempotent
assert set(result) == set(items) # Same valuesHypothesis generates hundreds of inputs, trying to falsify your properties. When it finds a failure, it shrinks the input to the minimal counterexample — not the randomly generated one, but the simplest possible input that triggers the bug.
Installation
pip install hypothesis pytestFor extras:
pip install hypothesis[numpy] # NumPy array strategies
pip install hypothesis[pandas] <span class="hljs-comment"># Pandas DataFrame strategies
pip install hypothesis[django] <span class="hljs-comment"># Django model strategiesCore Strategies
Strategies are generators that produce test data. They're composable — build complex structures from simple ones.
Basic Strategies
from hypothesis import strategies as st
# Primitive types
st.integers() # All integers
st.integers(min_value=0, max_value=100) # Bounded integers
st.floats() # Including NaN, inf
st.floats(allow_nan=False, allow_infinity=False)
st.text() # Unicode strings
st.text(alphabet=st.characters(whitelist_categories=['L', 'N'])) # Letters+digits only
st.booleans()
st.none()
# Collections
st.lists(st.integers())
st.lists(st.integers(), min_size=1, max_size=10)
st.tuples(st.integers(), st.text())
st.dictionaries(keys=st.text(), values=st.integers())
st.frozensets(st.integers())
# Sampling
st.sampled_from(['red', 'green', 'blue'])
st.one_of(st.integers(), st.text(), st.none()) # Any of these typesComposition
# Nested structures
st.lists(st.dictionaries(
keys=st.text(min_size=1, max_size=20),
values=st.one_of(st.integers(), st.text(), st.none())
))
# Chaining with map and filter
st.integers().map(abs) # Always positive
st.integers().filter(lambda x: x % 2 == 0) # Only even numbers
# Building objects
st.builds(MyClass,
id=st.integers(min_value=1),
name=st.text(min_size=1)
)Custom Strategies with @composite
from hypothesis.strategies import composite
@composite
def valid_email(draw):
"""Generate syntactically valid email addresses"""
user = draw(st.text(
alphabet=st.characters(whitelist_categories=['L', 'N']),
min_size=1, max_size=20
))
domain = draw(st.text(
alphabet=st.characters(whitelist_categories=['L', 'N']),
min_size=2, max_size=10
))
tld = draw(st.sampled_from(['com', 'org', 'net', 'io', 'co']))
return f"{user}@{domain}.{tld}"
@composite
def valid_date_range(draw):
"""Generate a start date that is always before end date"""
from datetime import date, timedelta
start = draw(st.dates(min_value=date(2000, 1, 1), max_value=date(2030, 12, 31)))
days = draw(st.integers(min_value=1, max_value=365))
end = start + timedelta(days=days)
return start, end
@given(valid_date_range())
def test_date_range_logic(date_range):
start, end = date_range
assert start < end # Always true by construction
schedule = create_schedule(start, end)
assert len(schedule) >= 1Writing Effective Properties
The hardest part of property-based testing is identifying what to test. Here are patterns:
Round-Trip Properties
If encode/decode are inverse operations, round-trips should be identity:
import json
from hypothesis import given
from hypothesis import strategies as st
@given(st.recursive(
st.one_of(
st.none(),
st.booleans(),
st.integers(-10**6, 10**6),
st.floats(allow_nan=False, allow_infinity=False),
st.text()
),
lambda children: st.one_of(
st.lists(children, max_size=5),
st.dictionaries(st.text(), children, max_size=5)
),
max_leaves=20
))
def test_json_round_trip(value):
"""JSON encode/decode must be identity for serializable values"""
assert json.loads(json.dumps(value)) == valueIdempotency
Applying an operation twice should give the same result as once:
@given(st.text())
def test_normalize_idempotent(text):
once = normalize(text)
twice = normalize(once)
assert once == twice, f"normalize not idempotent: {text!r} → {once!r} → {twice!r}"
@given(st.lists(st.integers()))
def test_sort_idempotent(items):
assert sorted(sorted(items)) == sorted(items)Commutativity / Associativity
Operations that should be order-independent:
@given(st.integers(), st.integers())
def test_addition_commutative(a, b):
assert a + b == b + a
@given(st.sets(st.integers()), st.sets(st.integers()))
def test_union_commutative(a, b):
assert a | b == b | aInvariant Preservation
Operations should preserve certain invariants:
@given(st.lists(st.integers()), st.integers())
def test_insert_increases_length(items, new_item):
collection = MyCollection(items)
original_size = len(collection)
collection.insert(new_item)
assert len(collection) == original_size + 1, "insert must increase size by 1"
assert new_item in collection, "inserted item must be findable"
@given(st.text(), st.text())
def test_string_concat_length(a, b):
assert len(a + b) == len(a) + len(b)Comparative / Differential Testing
Two implementations must agree:
@given(st.text())
def test_regex_engines_agree(pattern_and_text):
# Compare two regex implementations
import re
import regex # Drop-in replacement with more features
try:
expected = re.match(some_pattern, pattern_and_text)
actual = regex.match(some_pattern, pattern_and_text)
# Both should succeed or both should fail
assert bool(expected) == bool(actual)
if expected and actual:
assert expected.group(0) == actual.group(0)
except re.error:
pass # Pattern errors are OKSettings and Configuration
from hypothesis import given, settings, HealthCheck, Phase
@given(st.text())
@settings(
max_examples=1000, # More examples = more thorough
deadline=5000, # 5 second max per test (ms)
suppress_health_check=[HealthCheck.too_slow],
deriving=True # Enable shrinking (default)
)
def test_with_custom_settings(text):
process(text)
# Profile-based settings for different environments
settings.register_profile("ci", max_examples=200)
settings.register_profile("dev", max_examples=50)
settings.register_profile("thorough", max_examples=10000)
settings.load_profile("ci") # Or set HYPOTHESIS_PROFILE=ciThe Hypothesis Database
Hypothesis remembers failing examples in a local database (~/.hypothesis/examples/ or .hypothesis/). Next time you run the test, it replays known failures first. This means:
- Regressions are caught immediately
- CI doesn't need to re-discover known bugs
- Shrunk minimal examples are stored
# Configure database location
from hypothesis import settings
from hypothesis.database import DirectoryBasedExampleDatabase
settings.register_profile("ci",
database=DirectoryBasedExampleDatabase(".hypothesis")
)Commit .hypothesis/ to git to share known examples with your team.
Stateful Testing
For testing stateful systems (queues, databases, UIs), Hypothesis can generate sequences of operations:
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant, initialize
class BankAccountMachine(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.balance = 0
self.account = BankAccount()
@initialize(amount=st.integers(min_value=1, max_value=10000))
def create_account(self, amount):
"""Initialize with a starting balance"""
self.balance = amount
self.account.deposit(amount)
@rule(amount=st.integers(min_value=1, max_value=1000))
def deposit(self, amount):
self.balance += amount
self.account.deposit(amount)
@rule(amount=st.integers(min_value=1, max_value=1000))
def withdraw(self, amount):
if amount <= self.balance:
self.balance -= amount
self.account.withdraw(amount)
@invariant()
def balance_matches(self):
"""At all times, our model must match the actual account"""
assert self.account.get_balance() == self.balance, \
f"Balance mismatch: model={self.balance}, actual={self.account.get_balance()}"
@invariant()
def balance_non_negative(self):
assert self.balance >= 0, "Balance went negative"
# pytest automatically discovers this as a test
TestBankAccount = BankAccountMachine.TestCaseStateful testing generates random sequences of operations and checks that invariants hold throughout. This is extraordinarily effective for finding bugs in state machines.
Django and Database Testing
from hypothesis.extra.django import TestCase, from_model
from myapp.models import Product, Category
class ProductPropertyTests(TestCase):
@given(from_model(
Product,
name=st.text(min_size=1, max_size=200),
price=st.decimals(min_value='0.01', max_value='9999.99', places=2),
))
def test_product_slug_is_unique_when_name_is_unique(self, product):
# from_model creates real database rows and cleans up
assert product.pk is not None
assert len(product.slug) > 0from_model creates real Django model instances, respecting constraints and foreign keys. It cleans up after each test.
Shrinking: Why It Matters
When Hypothesis finds a failure, it shrinks the input to the minimal counterexample. This is one of its most valuable features:
@given(st.lists(st.integers()))
def test_sum_of_list(items):
result = sum_naive(items)
assert result == sum(items)If sum_naive has a bug on lists with more than 1000 elements, Hypothesis won't just show you the 10,000-element list it randomly generated. It will show you [0, 0, ..., 0, 1] with exactly 1001 elements — the smallest input that triggers the bug.
This makes debugging dramatically faster. Most bugs have minimal triggers that a human would never generate manually.
Fuzzing vs Property Testing
Hypothesis blurs the line between fuzzing and property testing:
- Fuzzing: "Generate random inputs, look for crashes"
- Property testing: "Generate random inputs, check these properties"
Hypothesis does both. When you write @given(st.text()) with a function that just calls your_function(text) without assertions, you're fuzzing for crashes. When you add assertions, you're property testing.
# Fuzzing mode: just looking for panics
@given(st.binary())
def fuzz_deserializer(data):
try:
deserialize(data)
except (ValueError, TypeError):
pass # Expected errors are OK
# Property mode: checking invariants
@given(st.text())
def test_parser_properties(text):
result = parse(text)
if result.success:
assert result.value is not None
assert serialize(result.value) != ""CI Integration
# GitHub Actions
- name: Run Hypothesis tests
run: pytest tests/ -x --hypothesis-seed=12345
env:
HYPOTHESIS_PROFILE: ci
# Store database for persistence across runs
- name: Cache Hypothesis database
uses: actions/cache@v4
with:
path: .hypothesis/
key: hypothesis-${{ hashFiles('tests/**/*.py') }}Using --hypothesis-seed with a fixed seed makes test runs reproducible in CI (while still being random). Different seeds find different bugs, so periodically vary it.
Common Gotchas
1. Tests too slow? Use @settings(max_examples=50) locally, 200 in CI.
2. Flaky tests? Usually means your code has implicit assumptions Hypothesis is violating. Add explicit preconditions with assume():
from hypothesis import given, assume
@given(st.integers(), st.integers())
def test_division(a, b):
assume(b != 0) # Skip examples where b is 0
assert isinstance(a / b, float)3. Complex data too slow to generate? Use st.data() for lazy generation:
from hypothesis import given, strategies as st
@given(st.data())
def test_related_data(data):
n = data.draw(st.integers(min_value=1, max_value=100))
items = data.draw(st.lists(st.integers(), min_size=n, max_size=n))
# items is always length nFrom Testing to Production Monitoring
Hypothesis tests properties that always hold — but production has data and conditions your tests can't generate. Database state built up over years, third-party API behavior, concurrent user interactions.
HelpMeTest runs continuous end-to-end tests against your live application 24/7. While Hypothesis catches algorithmic bugs, HelpMeTest catches integration failures, UI regressions, and runtime degradation that only show up in production. Write tests in plain English — no code required. $100/month flat.
Summary
Hypothesis is the most powerful Python testing library most developers aren't using. It:
- Finds bugs that example-based tests miss
- Shrinks failures to minimal, readable counterexamples
- Works with any Python function, with or without assertions
- Integrates seamlessly with pytest
- Provides stateful testing for complex, multi-step scenarios
Start with your parsing functions, data validators, and any function that transforms data. Add @given(st.text()) and run pytest. Hypothesis will find something surprising — it always does.