Property-Based and Model Checking Hybrid Testing Strategies
Property-based testing and model checking are two of the most powerful tools in a test engineer's toolkit, and most teams use neither. The teams that do use them tend to use only one, treating them as alternatives. But they solve different problems and work best together. This post explains what each tool does, when to use each, and how to combine them into a hybrid strategy that catches both design flaws and implementation bugs.
What Property-Based Testing Does
Property-based testing (PBT) — popularized by QuickCheck for Haskell, and now available in every major language — inverts the way you write tests. Instead of specifying f(3) == 7, you specify a property: for all x, f(x) >= 0. The framework generates hundreds or thousands of random inputs and checks whether the property holds for each one. When it finds a failure, it shrinks the input to the minimal counterexample.
Python's Hypothesis is the most sophisticated PBT library available today:
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_sorted_is_sorted(lst):
result = my_sort(lst)
assert result == sorted(lst)
@given(st.lists(st.integers()))
def test_sorted_same_elements(lst):
result = my_sort(lst)
assert sorted(result) == sorted(lst)
assert len(result) == len(lst)The first property checks output correctness against a reference implementation. The second checks that sorting does not lose or duplicate elements. Hypothesis generates thousands of inputs including edge cases: empty lists, lists with one element, lists with all identical elements, lists with large negative numbers, lists already sorted in reverse order.
PBT is excellent at finding implementation bugs in isolated functions and components. It works at the unit test level and integrates naturally with your existing test infrastructure.
What Model Checking Does
Model checking explores all possible states of a system up to a given bound. You write a formal model of your system (not the code — the design), specify properties, and the model checker verifies those properties for every possible execution path.
The key difference from PBT: model checking verifies the design, while PBT tests the implementation. Model checking is exhaustive within its scope; PBT is statistical. Model checking can find bugs that require a specific sequence of 20 operations across 4 processes; PBT cannot easily express multi-step concurrent scenarios.
As covered in earlier posts, the main model checking tools are TLA+ (distributed systems), SPIN (concurrent systems), and Alloy (structural/relational designs).
The Gap Between the Two
Both tools have blind spots:
PBT blind spots:
- Concurrent execution: PBT typically runs in a single thread and cannot find race conditions
- System-level state machines: PBT operates on functions, not on multi-step protocol behaviors
- Design bugs: PBT tests code, not the algorithm the code implements
Model checking blind spots:
- Implementation correctness: the model is an abstraction; code bugs in the abstracted-away parts are invisible
- Integration behavior: model checking does not test real I/O, network, or third-party APIs
- Scope limits: model checking verifies within a finite scope, not universally
The hybrid strategy uses each tool where it excels.
Generating Test Cases from Models
One of the most powerful hybrid techniques is generating PBT inputs or test cases directly from a model checker's output.
When TLA+ checks a property and finds a counterexample, it produces a trace — a specific sequence of actions and states. This trace is exactly the edge case that breaks your invariant. You can translate that trace into a test:
Suppose your TLA+ model of a distributed lock finds a counterexample: two nodes acquire the lock simultaneously when the network partitions and then reconnects. The TLA+ trace shows:
- Node A acquires lock, state =
{holder: A} - Network partition between A and B
- Lock times out from B's perspective
- Node B acquires lock, state =
{holder: B}(B thinks A is gone) - Network heals
- Both A and B believe they hold the lock
This trace becomes a test in your integration test suite:
def test_split_brain_lock_on_network_heal():
cluster = TestCluster(nodes=2)
lock = DistributedLock(cluster)
# Step 1: A acquires lock
assert lock.acquire(node='A')
# Step 2-3: Partition and timeout
cluster.partition(node_a='A', node_b='B')
time.sleep(lock.timeout + 1)
# Step 4: B acquires lock
assert lock.acquire(node='B')
# Step 5: Network heals
cluster.heal()
# Step 6: Verify only one holder (property from TLA+)
holders = lock.get_holders()
assert len(holders) == 1, f"Split brain: both {holders} hold the lock"This test is directly derived from the formal counterexample. It ensures the specific scenario found by model checking is covered in the implementation test suite.
Modeling Your System State Machine for PBT
Another hybrid technique is using your model as a reference implementation for PBT stateful testing.
Hypothesis has a stateful module that tests state machines:
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
import hypothesis.strategies as st
class LockModel(RuleBasedStateMachine):
"""Model of distributed lock behavior"""
def __init__(self):
super().__init__()
self.holder = None
self.lock = DistributedLock() # real implementation
@rule(node=st.sampled_from(['A', 'B', 'C']))
def acquire(self, node):
result = self.lock.acquire(node)
if self.holder is None:
assert result == True
self.holder = node
else:
assert result == False # lock already held
@rule()
def release(self):
if self.holder is not None:
self.lock.release(self.holder)
self.holder = None
@invariant()
def at_most_one_holder(self):
actual_holders = self.lock.get_holders()
if self.holder is None:
assert len(actual_holders) == 0
else:
assert actual_holders == [self.holder]
LockTest = LockModel.TestCaseHypothesis generates random sequences of acquire and release calls, checking the invariant after each step. This catches implementation bugs in the real DistributedLock that the model checking would not find (because the model is an abstraction).
The model itself — self.holder — is a simplified reference implementation derived from your TLA+ or Alloy model. The invariant comes directly from the model's safety property.
Bridging Formal Methods and Unit Testing in Practice
Here is a practical workflow that integrates both tools:
Phase 1: Design (before code)
Write a TLA+ or Alloy spec of the system. Check it with the model checker. Fix any design bugs found. This phase typically takes a day or two for a feature-sized component.
Phase 2: Derive properties (before code)
From the model's invariants and safety properties, derive a list of testable properties. Each invariant in the model becomes a @invariant in a Hypothesis stateful test. Each safety property becomes a @given test.
Phase 3: Implement (TDD)
Implement the code. Run the PBT tests. They will initially fail because the code is new.
Phase 4: Lock in counterexample scenarios (after model checking)
For any counterexamples found during Phase 1, write deterministic regression tests (not PBT — these scenarios should always run). These are the specific paths the model checker found.
Phase 5: CI
Run both model checking (for small spec sizes) and PBT in CI. The model checker verifies the design has not changed. The PBT tests verify the implementation matches the design.
When to Use Each
A practical decision tree:
Question: Are you dealing with concurrency or distributed systems? → Yes → Model checking (TLA+/SPIN) before writing code. PBT for the implementation. → No → Skip to next question.
Question: Is the correctness property about structure or relationships (access control, data model)? → Yes → Alloy for design verification. PBT for implementation. → No → Skip to next question.
Question: Can you express the property as "for all inputs, this function satisfies X"? → Yes → PBT is the right tool. Write Hypothesis tests. → No → You may need a model checker or a formal specification.
Question: Is this safety-critical and small enough for full verification? → Yes → Consider Isabelle/HOL for the critical core. → No → PBT with coverage-guided fuzzing (covered in the fuzzing cluster).
QuickCheck for Haskell and Scala
The original QuickCheck (Haskell) has the most mature ecosystem for deriving generators and integrating with proof assistants. Haskell's type system makes writing generators natural:
import Test.QuickCheck
prop_sortIdempotent :: [Int] -> Bool
prop_sortIdempotent xs = sort (sort xs) == sort xs
prop_sortLength :: [Int] -> Bool
prop_sortLength xs = length (sort xs) == length xs
prop_sortSorted :: [Int] -> Bool
prop_sortSorted xs = isSorted (sort xs)
where
isSorted [] = True
isSorted [_] = True
isSorted (x:y:ys) = x <= y && isSorted (y:ys)
main :: IO ()
main = do
quickCheck prop_sortIdempotent
quickCheck prop_sortLength
quickCheck prop_sortSortedScalaCheck is the Scala equivalent, integrated with ScalaTest:
import org.scalacheck.Properties
import org.scalacheck.Prop.forAll
object SortSpec extends Properties("Sort") {
property("idempotent") = forAll { (xs: List[Int]) =>
sort(sort(xs)) == sort(xs)
}
property("length preserved") = forAll { (xs: List[Int]) =>
sort(xs).length == xs.length
}
}The Payoff
The hybrid approach catches a class of bugs that neither tool finds alone:
- Model checking found a theoretical counterexample to your protocol → you have a regression test that prevents that exact scenario from regressing in code
- PBT found a failing input to your implementation → you add that input to your model to check whether the model predicts the failure (if not, the model is incomplete)
- An invariant in your model → a PBT invariant in your stateful test → automatic verification that code matches design
Teams that adopt this approach report finding bugs earlier, having more confidence in complex code, and spending less time debugging production incidents. The total time investment is front-loaded (writing the model and the PBT tests takes time), but the ongoing debugging cost drops significantly.
The two tools are not alternatives — they are complementary layers of a verification strategy. Use the model checker to verify your thinking. Use PBT to verify your code matches your thinking.