TLA+ Model Checking for Distributed Systems Testing

TLA+ Model Checking for Distributed Systems Testing

Distributed systems fail in ways that are nearly impossible to reproduce with conventional tests. Race conditions, partial failures, network partitions, and subtle ordering bugs hide in the gaps between nodes, surfacing only under specific interleavings that happen to occur in production at 3 AM. TLA+ is a formal specification language that lets you express the behavior of your system mathematically and then exhaustively check every possible state your system can reach. It has found real bugs in systems like Amazon DynamoDB, Azure Cosmos DB, MongoDB, and the Raft consensus protocol. This post explains how to use it.

What TLA+ Actually Is

TLA+ (Temporal Logic of Actions) was created by Leslie Lamport, the same person who invented the Lamport timestamp, Paxos, and LaTeX. It is not a programming language. It is a mathematical notation for describing state machines — systems that have a set of variables (the state) and actions that transition from one state to another.

The key insight is that any concurrent or distributed system can be modeled as a non-deterministic state machine. Two nodes sending messages to each other, a client retrying a failed request, a lock being acquired and released — all of these are state transitions. TLA+ lets you write down exactly what those transitions are, and then the TLC model checker explores every possible sequence of transitions to verify that certain properties always hold.

There are two kinds of properties you check:

Safety properties: Something bad never happens. "Two nodes never believe they are the leader at the same time." "A value is never overwritten once committed." These are the absence of bad states.

Liveness properties: Something good eventually happens. "A request eventually gets a response." "A leader is eventually elected." These require temporal logic to express.

Writing Your First TLA+ Spec

TLA+ specs are plain text files with a .tla extension. The TLC model checker is a Java application. The TLA+ Toolbox is an IDE, but VS Code with the TLA+ extension works well too.

Here is a minimal spec for a simple counter that must never go negative:

---- MODULE Counter ----
EXTENDS Naturals

VARIABLES count

Init == count = 0

Increment == count' = count + 1

Decrement == count > 0 /\ count' = count - 1

Next == Increment \/ Decrement

Spec == Init /\ [][Next]_count

NeverNegative == count >= 0

====

The Init predicate describes the initial state. Next is a disjunction of all possible actions. Spec is the complete behavior: start in Init, then always take a Next step or stutter (stay in place). NeverNegative is the invariant we want to verify.

For realistic distributed systems, the specs get much larger, but the structure is the same.

Modeling the Raft Consensus Protocol

Raft is a consensus algorithm designed to be understandable. Diego Ongaro, one of Raft's authors, wrote a TLA+ spec as part of his dissertation. Let me walk through the key ideas.

A Raft cluster has N servers, each in one of three states: follower, candidate, or leader. The model tracks several variables:

VARIABLES
  currentTerm,   \* current term for each server
  state,         \* follower, candidate, or leader
  votedFor,      \* who each server voted for in each term
  log,           \* the replicated log
  commitIndex,   \* highest committed log index
  \* ... network message queues, etc.

The critical safety property is ElectionSafety: at most one leader per term.

ElectionSafety ==
  \A e1, e2 \in elections :
    e1.eterm = e2.eterm => e1.eleader = e2.eleader

And LogMatching: if two logs agree at index i, they agree on all entries before i.

LogMatching ==
  \A i \in DOMAIN log[s1] :
    \A j \in DOMAIN log[s2] :
      i = j /\ log[s1][i].term = log[s2][i].term =>
        SubSeq(log[s1], 1, i) = SubSeq(log[s2], 1, j)

When TLC checks this spec with 3 servers and terms up to 3, it explores millions of states and verifies these properties hold. The original spec actually found several subtle issues in early drafts of the Raft paper — cases where the description was ambiguous and one interpretation violated safety.

Leader Election: A Worked Example

Let us build a simpler leader election spec from scratch to understand the methodology.

We have N nodes. Each has a unique ID. They exchange RequestVote and GrantVote messages. A node wins election if it gets a quorum.

---- MODULE LeaderElection ----
EXTENDS Naturals, FiniteSets, TLC

CONSTANTS Nodes
ASSUME IsFiniteSet(Nodes)
Quorum == {Q \in SUBSET Nodes : Cardinality(Q) * 2 > Cardinality(Nodes)}

VARIABLES
  state,      \* "follower" | "candidate" | "leader"
  votes,      \* set of nodes that voted for each candidate
  msgs        \* in-flight messages

Init ==
  /\ state = [n \in Nodes |-> "follower"]
  /\ votes = [n \in Nodes |-> {}]
  /\ msgs = {}

BecomeCandidate(n) ==
  /\ state[n] = "follower"
  /\ state' = [state EXCEPT ![n] = "candidate"]
  /\ msgs' = msgs \cup {[type |-> "RequestVote", from |-> n, to |-> m] : m \in Nodes \ {n}}
  /\ UNCHANGED votes

GrantVote(n, m) ==
  /\ [type |-> "RequestVote", from |-> m, to |-> n] \in msgs
  /\ state[n] = "follower"
  /\ votes' = [votes EXCEPT ![m] = votes[m] \cup {n}]
  /\ UNCHANGED <<state, msgs>>

BecomeLeader(n) ==
  /\ state[n] = "candidate"
  /\ \E Q \in Quorum : Q \subseteq votes[n]
  /\ state' = [state EXCEPT ![n] = "leader"]
  /\ UNCHANGED <<votes, msgs>>

Next ==
  \/ \E n \in Nodes : BecomeCandidate(n)
  \/ \E n, m \in Nodes : GrantVote(n, m)
  \/ \E n \in Nodes : BecomeLeader(n)

\* Safety: at most one leader at any time
AtMostOneLeader ==
  Cardinality({n \in Nodes : state[n] = "leader"}) <= 1

====

This spec, even simplified, lets TLC find the bug if you introduce it. If you remove the quorum check and allow any node with any votes to become leader, TLC immediately finds a trace where two nodes both become leaders simultaneously.

Running TLC from the Command Line

The TLA+ Toolbox is a GUI, but for CI you want the command line. Download tla2tools.jar from the TLA+ GitHub releases.

Create a model configuration file MC.cfg:

SPECIFICATION Spec
INVARIANT AtMostOneLeader
INVARIANT ElectionSafety
PROPERTY Liveness
CONSTANTS
  Nodes = {n1, n2, n3}

Run TLC:

java -jar tla2tools.jar -config MC.cfg LeaderElection.tla

For larger models, run with more workers and heap:

java -Xmx8g -jar tla2tools.jar -workers 8 -config MC.cfg LeaderElection.tla

TLC will print the number of states explored, any violated properties, and — critically — a counterexample trace showing exactly the sequence of actions that led to the violation.

Integrating TLA+ into CI

You can run TLC in GitHub Actions. The workflow looks like this:

name: TLA+ Model Checking
on: [push, pull_request]
jobs:
  model-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-java@v3
        with:
          java-version: '17'
          distribution: 'temurin'
      - name: Download TLA+ tools
        run: |
          wget -q https://github.com/tlaplus/tlaplus/releases/latest/download/tla2tools.jar
      - name: Run TLC on Raft spec
        run: |
          java -Xmx4g -jar tla2tools.jar \
            -workers 4 \
            -config specs/raft/MC.cfg \
            specs/raft/Raft.tla
      - name: Run TLC on leader election spec
        run: |
          java -Xmx4g -jar tla2tools.jar \
            -workers 4 \
            -config specs/election/MC.cfg \
            specs/election/LeaderElection.tla

The key limitation: model checking is exhaustive up to a bound. You set the size of your model (number of nodes, maximum term, etc.) and TLC checks all states within that bound. For Raft with 3 nodes and terms up to 5, this can take minutes to hours. The strategy is to start small and gradually increase the bounds.

Finding Real Bugs

Amazon's use of TLA+ is the most-cited industry case study. Their engineers found 10 bugs in new algorithms before any code was written — including one in a distributed transaction protocol that would have caused data loss. The bugs only manifested in sequences of 35+ steps, far beyond what any test or code review would catch.

The pattern is always the same: a developer describes the algorithm informally, believes it is correct, writes a TLA+ spec, runs TLC, and finds a violation in minutes. The counterexample trace shows exactly why the algorithm is wrong and what the fix should be.

For your own systems, the best candidates for TLA+ are:

  • Consensus protocols you have built or modified
  • Distributed lock implementations
  • Leader election logic
  • Saga/compensation workflows
  • Cache invalidation protocols
  • Replication lag handling

The workflow is: write the spec, run TLC with small parameters, verify properties hold, then increase parameters until you hit memory limits. If TLC finds a violation, fix the spec (fix your design), then verify again.

TLA+ vs Writing More Tests

This is the most common question. TLA+ and tests serve different purposes and are not substitutes.

Tests verify that specific inputs produce specific outputs. They test code. TLA+ verifies that your algorithm is correct across all possible interleavings, regardless of implementation. TLA+ catches design bugs that no test can catch, because the bug is in the logic of the protocol, not the code.

The right approach is both. Use TLA+ to verify protocol correctness before writing code. Use tests to verify the implementation matches the spec. Some teams generate test cases from TLA+ traces — the counterexample paths that TLC finds are exactly the edge cases worth testing.

Practical Tips for Getting Started

Start with existing specs. The TLA+ examples repository on GitHub has specs for Paxos, Raft, two-phase commit, and dozens of other protocols. Read them, run TLC on them, break them intentionally, and see what TLC reports.

Keep your first specs simple. Model the algorithm, not the implementation. Leave out networking details, serialization, and error handling until you have the core logic right.

Use PlusCal for complex algorithms. PlusCal is a pseudocode language that compiles to TLA+. It is easier to write for imperative algorithms and is used in many AWS specs.

The TLA+ Video Course by Lamport himself is free and excellent. The Practical TLA+ book by Hillel Wayne is the best practical guide.

Formal methods have a reputation for being academic and impractical. The engineers at Amazon, Microsoft, and MongoDB who use TLA+ daily would disagree. For the kinds of bugs that distributed systems produce, it is the most effective tool available.

Read more