Ernie

Most bugs aren't logic errors. They're branches you forgot to prune.

aitypescriptsoftware-engineeringstate-managementfeature-flagstypesformal-methodspostgresformal-verificationrails

A small blob pruning the dead branches off a living tree, a neat pile of cut deadwood beside it — the whole article in one picture.

Contents

ArticleWidget case) or a plain ; keep it a no-op until the file lands so the .md stays portable. -->

Twenty flags is a million timelines

Open your feature flag dashboard. Count the flags older than six months you "meant to clean up." I'll wait.

Now multiply them. Each flag forks the timeline — on or off. One flag, two timelines. Two, four. Three, eight. It doubles every single time.

Your feature flags

20 flags

20 flags

1,048,576

possible configurations — doubling with every flag

Testable? nonobody writes 1,048,576 test cases

Test one configuration per second, around the clock: about 12 days to get through them.

What actually gets tested: the dozen you flip on purpose. The rest just exist — in production.

Drag it to your team's number. I'll wait for that too.

Nobody tests a million timelines. You test the dozen you flip on purpose for cohorts. The rest just exist. In production. Never seen by QA, waiting for the right user to load the right page in the right order.

I keep coming back to Knight Capital. 2012: $440 million gone in 45 minutes. One feature flag, eight servers. Seven got the new deploy, one didn't — and when the flag flipped on, server eight ran eight-year-old dead code nobody had pruned. The whole state space was sixteen combinations. They'd tested one. Forty-five minutes ended the company. The full postmortem is worth your time.

And the bug never looks like a bug. It looks like "customer X is seeing the wrong price." You chase it as a race condition until someone pulls the flag history and sees this exact three-flag combo has never happened in the system's life, never will again after Friday's release, and is breaking precisely one invoice right now. Nobody designed that state. Every flag did what it said on the toggle. The combination was a timeline no human ever imagined.

Nobody told me this in school. Most bugs aren't logic errors. They're states you forgot existed — branches of the timeline nobody meant to leave open.

And it isn't just flags. A form with 15 optional fields has 32,768 presence-and-absence combinations. A user model with isAdmin, isVerified, isBanned, isDeleted has 16 states, about four of which make sense. Subscription billing with its out-of-order Stripe webhooks, a background job that retries through a half-finished write, a distributed system where every message can arrive twice or never — same shape every time. The set of states your code can be in dwarfs the set it actually handles.

The gap between the two is where every bug lives.

An enormous tree of bare dead branches with a single small living leafy cluster in the middle; a tiny blob points up at it — almost everything you can build is dead wood, only a little is valid.

And you can't test your way out — the math is against you, always has been. Knight was the cheap version; about a quarter of cloud outages trace back to misconfiguration nobody tested in combination. The state space expands by default; your ability to verify it grows linearly at best. The gap widens every time someone ships a feature.


Now the AI is branching it too

Here's what changed in 2026: you're not the only one forking the timeline anymore.

An LLM writes half your code now, and nobody reads every line — that's the point of generating it. The branches multiply faster than any human can review. Reviewing-by-reading, the thing we've leaned on since forever, is quietly over.

So stop trying to read every branch. Prune the ones that should never exist — make them impossible to even write down. Done right, the state space stops being a liability and becomes the one contract you can still enforce while the model holds the keyboard. A branch the AI can't spell is a bug it can't ship.

(Yes — like the TVA pruning timelines in Loki. That's literally why the tool at the end of this piece is called prune-states.)

That's the whole article: find the branches that shouldn't exist, and cut them. Cheapest cut first.


Pruning means the bad branch can't be spelled

The two ideas under all of this are old, both from the functional programming world. Make illegal states unrepresentable — Scott Wlaschin's phrasing of an idea Yaron Minsky was teaching a decade earlier: design the shape itself so the bad states have no spelling. If a state shouldn't exist, make it so you can't even construct it. Parse, don't validate — Alexis King: check incoming data once, at the boundary, and return a type that carries the proof, so the whole interior of your program runs on values correct by construction and never re-checked.

Same move, two angles: shrink what's representable until it matches what's valid, and do the cut once, at the door. Both work in any modern language — not just Haskell.

Kill the booleans

The classic example. You have an order with three boolean fields:

interface Order {
  isDraft: boolean
  isPublished: boolean
  isArchived: boolean
}

Three booleans. Eight possible states. But only three are valid: draft, published, archived. The other five — isDraft && isPublished, isPublished && isArchived && isDraft, etc. — are nonsense. But your code has to deal with them because the type system says they exist.

Order · 3 status booleans

none?draftpublisheddra+pubarchiveddra+arcpub+arcdra+pub+arc
8 representable3 valid5 nonsense

Three booleans, eight timelines — but only three are real states. The other five are branches nobody designed. Prune them.

The fix is a discriminated union — a type that says "this value is exactly one of these shapes, never a mix." Some languages call it a "tagged union," "sum type," or "variant." Same idea everywhere:

// TypeScript
type Order =
  | { status: "draft"; content: string }
  | { status: "published"; content: string; publishedAt: Date }
  | { status: "archived"; content: string; archivedAt: Date }

Eight states → three. publishedAt only exists when the order is published. archivedAt only exists when it's archived. You can't construct a published-and-draft order because there's no type for that. Not "it's not allowed at runtime" — it does not exist as a type. The bad states aren't rejected; they were never representable to begin with.

And when you switch on this, the compiler makes you handle every case:

// TypeScript — miss a case and it won't compile
function render(order: Order) {
  switch (order.status) {
    case "draft":
      return renderDraft(order.content)
    case "published":
      return renderPublished(order.content, order.publishedAt)
    case "archived":
      return renderArchived(order.content, order.archivedAt)
  }
}

And the same idea works at the database level. SQL is weaker than a type system, but it's enforced on every write, by every client, forever:

-- PostgreSQL — the database refuses to store illegal states
CREATE TABLE orders (
  id           UUID PRIMARY KEY,
  status       TEXT NOT NULL CHECK (status IN ('draft', 'published', 'archived')),
  content      TEXT NOT NULL,
  published_at TIMESTAMPTZ,
  archived_at  TIMESTAMPTZ,
 
  -- Same invariant as the discriminated union, at the persistence layer.
  -- Note the archived branch drops `published_at` — matching the TS
  -- variant where archived doesn't carry publishedAt. If you want to
  -- preserve publish history through archive, model it explicitly.
  CHECK (
    (status = 'draft'     AND published_at IS NULL     AND archived_at IS NULL) OR
    (status = 'published' AND published_at IS NOT NULL AND archived_at IS NULL) OR
    (status = 'archived'  AND published_at IS NULL     AND archived_at IS NOT NULL)
  )
);

Less elegant. Less expressive. But your TypeScript types stop at the network boundary, and your database doesn't. Even if some service in another language forgets the invariant, the CHECK constraint catches it. We'll come back to this.

Add a fourth status? The compiler tells you every place that needs updating. Not "please remember to check everywhere." The code won't build until you do.

Same trick, applied to the running example

The same idea works on whatever multiplicative thing you're modeling. To close the loop on the running example: instead of N independent feature flag booleans, define the valid configurations as a union.

// Before: 4 booleans = 16 possible states
const newUi = client.boolVariation("new-ui", false)
const newPayments = client.boolVariation("new-payments", false)
const fullRedesign = client.boolVariation("full-redesign", false)
const layout = client.stringVariation("layout", "single-page")
 
// What does newPayments=true + newUi=false even mean?
// What about fullRedesign=true + newUi=true?
// Nobody knows. Nobody tested it.
// After: 4 valid configurations
type CheckoutConfig =
  | { variant: "control" }
  | { variant: "new-ui"; layout: "single-page" | "multi-step" }
  | {
      variant: "new-ui-with-payments"
      layout: "single-page" | "multi-step"
      provider: "stripe" | "square"
    }
  | { variant: "full-redesign" }

Sixteen states → four. The provider field only exists on the variant that uses it. layout doesn't exist on control. You can't accidentally check for config.provider when the user is in the control group — TypeScript won't let you.

Do the conversion in one place. A single resolve function maps the raw flag values into a typed config at the boundary; after that, the type system does the rest.

One resolve at the boundary, one exhaustive switch
function resolveCheckout(raw: RawFlags): CheckoutConfig {
  if (flag(raw, "full-redesign")) return { variant: "full-redesign" }
  if (flag(raw, "new-ui") && flag(raw, "new-payments")) {
    return {
      variant: "new-ui-with-payments",
      layout: flagStr(raw, "checkout-layout", "single-page"),
      provider: flagStr(raw, "payment-provider", "stripe"),
    }
  }
  if (flag(raw, "new-ui")) {
    return {
      variant: "new-ui",
      layout: flagStr(raw, "checkout-layout", "single-page"),
    }
  }
  return { variant: "control" }
}
 
// Exhaustive — won't compile if you miss a variant
function render(config: CheckoutConfig) {
  switch (config.variant) {
    case "control":
      return renderLegacy()
    case "new-ui":
      return renderNewUi(config.layout)
    case "new-ui-with-payments":
      return renderFull(config.layout, config.provider)
    case "full-redesign":
      return renderRedesign()
  }
}

Parse, don't validate. At the boundary. Once.

And the invoice bug from the opening? With the flags collapsed into a union, that three-flag configuration no longer has a spelling.


When a type can't make the cut

The type is the cheapest cut, but not everything fits in one. Sometimes the state space is genuinely large and you can't shrink it further; sometimes the business won't let you. So you keep cutting — a small ladder of tools, each one costing more and buying a stronger guarantee than the last:

weaker: legal moves only ──────────────────▶ stronger: proved across all states
state machines  ·  property tests  ·  combinatorial  ·  model checking

A blob reaching for bigger and bigger tools — scissors, then loppers, then a saw — to cut progressively thicker dead branches: the cheapest cut first.

Step 1 — State machines

Same cut, one level up: instead of pruning bad states, prune bad moves. A flat list of booleans lets you set isPublished true and forget to clear isArchived. A state machine declares the legal transitions up front and refuses the rest. Draft can become published. Published can be archived. Archived can't go back to draft. Ever.

Step 1 in Ruby (AASM) and TypeScript (XState)
# Ruby with AASM
class Order < ApplicationRecord
  include AASM
  aasm do
    state :draft, initial: true
    state :published, :archived
 
    event(:publish) { transitions from: :draft, to: :published }
    event(:archive) { transitions from: [:draft, :published], to: :archived }
    # Note: no transition back from :archived. Once archived, terminal.
  end
end
 
order.publish!   # OK
order.archive!   # OK from draft or published
archived_order.publish!  # AASM::InvalidTransition — refused at runtime

The machine is the documentation and the enforcement. Reachable states are finite, declared in one place, and any transition not in the table gets refused. The graph itself is the type. (When a flat list isn't enough, Harel's statecharts add nesting and parallelism — that's where XState's hierarchical states come from.)

Step 2 — Property-based testing

When the space is too big to enumerate and you can't shrink it further, the laziest probe that still finds real bugs is to sample it at random. Instead of "given input X, expect output Y," you declare a property that should hold for any input and let the framework throw hundreds of random cases at it trying to break it. fast-check (TypeScript), Hypothesis (Python), and PropCheck (Ruby) all do this. You're not covering the whole state space — you're sampling it intelligently, which still beats the three cases you'd have picked by hand. And when the thing under test is a state machine rather than a pure function, the stateful variants — Hypothesis's RuleBasedStateMachine, fast-check's model-based commands — random-walk the transitions, not just the inputs: a model checker that samples paths instead of enumerating them.

But random sampling has a blind spot worth seeing. Here's the same 32-configuration space with a two-flag bug planted in it, sampled three ways: exhaustively (all 32), with a covering array (every pair by construction — the next rung), and at random. Flip between them, move the bug, roll the dice:

One space · three strategies

5 flags · 32 configurations

·····
A····
·B···
AB···
··C··
A·C··
·BC··
ABC··
···D·
A··D·
·B·D·
AB·D·
··CD·
A·CD·
·BCD·
ABCD·
····E
A···E
·B··E
AB··E
··C·E
A·C·E
·BC·E
ABC·E
···DE
A··DE
·B·DE
AB·DE
··CDE
A·CDE
·BCDE
ABCDE
tested bug lives here bug caught

The planted bug fires when A = off and D = off — a two-flag interaction, where 70–97% of real-world failures live.

Caught — and it always will be. These 6 rows contain every pair of flag values at least once, so any two-flag bug is in them by construction. Move the bug and watch. (Two-flag bugs are the guarantee — that's what t = 2 buys. A three-flag bug needs strength-3 rows; see the caveat in the prose.)

A property test rolls the dice a few hundred times; the one combination that sinks you might be the roll it never made. The next rung closes exactly that gap — sample systematically instead of randomly, and the buggy pair has nowhere left to hide. That's the guarantee you just watched the covering array keep, every time you moved the bug.

Step 3 — Combinatorial testing

Here's the one piece of good news in the whole article. Go back to the opening. Twenty flags, 1,048,576 configurations, and the despair was "nobody can test a million configs." Turns out you don't have to. Decades of empirical work at NIST found that real bugs cluster at low order: 70–97% of failures are triggered by a single parameter or one pair of them, ~98% by three or fewer, and across every dataset they studied, no real-world fault has ever needed more than six interacting parameters. The million-config space is real, but the bugs hiding in it almost all live in its pairs and triples.

So you don't cover the million — you cover the pairs. A covering array is a small set of test configurations that includes every pair of values at least once, and it stays small: all pairs among those twenty flags fit in as few as seven configurations. 1,048,576 → seven. That's the number the opening was missing. (The three-flag invoice bug from the opening? Bump the strength to t = 3 and its exact combination lands in the rows by construction.) Almost nobody does this: GitLab's open-source repo ships roughly five hundred feature flags, documents and tests them one flag at a time, and has no combinatorial strategy at all — and they're more disciplined about flags than most of us. The generators are off-the-shelf — NIST's own ACTS and Microsoft's PICT on the command line, or a one-line library right inside your test runner (concrete example below).

Don't take the collapse on faith — build it. Drag the flag count and watch 2^N drop to a handful of rows, then click any two columns and confirm all four value pairs are still there:

Build the covering array

5 flags

32

exhaustive — every configuration

6

covering array — every pair

Click any two columns to inspect a pair:

#
1
·
·
·
·
·
2
1
1
1
1
·
3
1
1
·
·
1
4
·
·
1
1
1
5
1
·
1
·
·
6
·
1
·
1
·
Columns A × B contain:00011011

All four present. Pick any other two columns — same story. Every pair of flag values lives in these 6 rows, not the 32 the exhaustive grid would need. At twenty flags that ratio is 1,048,576 → seven.

How covering arrays work, and the tools that build them

Three booleans — eight combinations — need just four rows to hit all pairs:

A B C
0 0 0
0 1 1
1 0 1
1 1 0

Check it: every pair of columns shows all four of 00, 01, 10, 11. The magic is how that scales. Covering-array size grows with the logarithm of the parameter count, and exponentially only in t — the interaction strength you're covering — not in the number of parameters. That's why twenty flags collapse to a handful of rows instead of a million.

You write one text file — PICT, Microsoft's generator, is the usual default:

new-ui:        on, off
new-payments:  on, off
full-redesign: on, off
layout:        single-page, multi-step
 
IF [full-redesign] = "on" THEN [new-ui] = "off";

pict model.txt prints the rows; each row is one test config. Constraints (the IF/THEN) keep it from generating combinations you've already made illegal. NIST's own ACTS tool does the same up to 6-way, with a GUI or CLI.

The easiest path skips the shell-out entirely: a covering-array library generates the rows in-process and hands them straight to your test runner's parametrize hook — no binary, no parsing. In Python with allpairspy:

from allpairspy import AllPairs
import pytest
 
flags = {
    "new_ui":        [True, False],
    "new_payments":  [True, False],
    "full_redesign": [True, False],
}
 
@pytest.mark.parametrize("cfg", AllPairs(list(flags.values())))
def test_checkout(cfg):
    new_ui, new_payments, full_redesign = cfg
    assert checkout_works(new_ui, new_payments, full_redesign)

Three flags, all pairs, in the handful of cases the table above predicts — just a decorator, no new infrastructure. Pick whichever fits your stack:

Where you run itTool
CLI, any languagePICT · NIST ACTS (≤ 6-way)
Python test runnerallpairspy
TypeScript / browsercovertable · pict-pairwise-testing · prune-states (cover)

One mechanical caveat: pairwise is still sampling. It provably covers every pair, but a fault that only fires on a specific three- or four-way combination can slip through (a medical-device study found a real four-way one). Bump to --o:3 for anything safety-critical; the cost is more rows, not a different tool.

That last point — pairwise is sampling, not proof — is also the one most in the spirit of this article: combinatorial testing is the consolation prize. If you can collapse those flags into a discriminated union so the nonsense combinations can't exist, do that — a bug that can't be represented needs no test at all. Covering arrays are for the irreducible remainder: the flags you genuinely can't merge, the legacy config you haven't refactored yet, the third-party surface you don't control.

But even for that remainder, a covering array still only samples. It guarantees every pair — bump the strength and every triple — yet that's a fixed interaction order across a fixed set of parameters, not a proof across every state your system can reach. The four-way bug slips a pairwise array; the bug that needs a specific sequence of transitions was never an "interaction" at all. Which begs the question: what if you could stop sampling — not a few hundred random configs, not just the pairs, but prove the property holds across every reachable state your system has?

Well. There is.

Step 4 — Model checking (formal verification)

The heaviest step. A model checker takes a precise description of your system — every state, every transition — and exhaustively explores all of them. Not random samples. Every reachable state, by construction. When an invariant fails, it hands you the exact sequence of steps that broke it.

Sounds academic until you notice who runs it. AWS has used TLA+ on S3 and DynamoDB since 2011. Azure verifies Cosmos DB the same way. seL4 — a microkernel proved correct down to the assembly — flies in military drones. Not research prototypes. Production systems you depend on right now.

And the lesson isn't "everyone should write TLA+." AWS's 2025 follow-up is blunter than that: their correctness work now spans a whole toolkit — the P language, property-based testing, deterministic simulation — because TLA+ alone didn't scale to thousands of engineers. The takeaway is the one this section is built on: pick the cheapest tool that exhausts your state space.

The entry barrier is one Java JAR. Let me show you the smallest version that makes the point.

Take our Todo machine: four states (active, done, archived, deleted) and the transitions between them. State a property it must never break — once deleted, always deleted — and TLC, the checker, walks every path from the start state and confirms it holds at each one. Four reachable states, all four checked. Not "I tried a handful of cases" — checked by construction, across the entire reachable state space. (One honest caveat: that's a proof of your model, not your system — which is exactly what you asked for. TLAPS proves; TLC checks.)

The payoff lands when someone changes the machine. Imagine a teammate adds an "undo delete" transition, so deleted can go back to active. Looks innocent. Compiles fine in any language. A property test might miss it unless the random generator happens to delete-then-restore in sequence. TLC catches it instantly — and it doesn't just say "failed." It hands you the exact minimal sequence that breaks the property: active → deleted → active, down to the line and column of the offending transition. A model checker doesn't return a red X; it returns a movie of how the system breaks.

Don't take the prose's word for it — here is that exact machine, live. Click Todo — undo-delete bug, or edit the JSON and add the restore transition yourself, and watch the invariant break with the counterexample attached. (Under the hood it hands the same one-step check to Z3, Microsoft Research's solver — think of it as a spell-checker for logical rules. Expand any verdict to see the exact question it's asked.)

prune-states · prove — live

Open full page

A state mapping to [] is terminal. Transitions are state names or { to, event?, when?, not? } when/not name a guard, checked as free booleans (a sound overapproximation, same as the package).

Todo · 4 states

activeinitialdonearchiveddeletedterminal

Assert terminal

“This state should be terminal — prove it.” The escape hatch for invariants the structure no longer shows.

Invariants · 1

  • DeletedIsTerminalproved

    Once kind = deleted, the machine stays there.

    SMT-LIB — reproduce with z3 check.smt2
    ; DeletedIsTerminal: Once kind = deleted, the machine stays there.
    ; Verification by inductive negation — unsat means proved.
    
    (set-logic ALL)
    
    (declare-datatypes ((State 0)) (((s_active) (s_done) (s_archived) (s_deleted))))
    
    (declare-const kind State)
    (declare-const kind_next State)
    
    ; Transition relation: kind' is determined by kind + guards.
    (assert (or
      (and (= kind s_active) (= kind_next s_done))
      (and (= kind s_active) (= kind_next s_archived))
      (and (= kind s_active) (= kind_next s_deleted))
      (and (= kind s_done) (= kind_next s_archived))
      (and (= kind s_done) (= kind_next s_deleted))
      (and (= kind s_archived) (= kind_next s_deleted))
      (and (= kind s_deleted) (= kind_next s_deleted))))
    
    ; Negation of the invariant — sat means a counterexample exists.
    (assert (= kind s_deleted))
    (assert (not (= kind_next s_deleted)))
    
    (check-sat)
    (get-model)
    

That's the whole pitch, and you don't need to read a line of TLA+ to get it — TLA+'s syntax isn't easy, which is half the reason the tool stays niche. It's only about fifteen lines, though, so if you want to see the spec and the actual output, here it is:

The full TLA+ spec, line by line

Four pieces of vocabulary if you haven't seen TLA+ before:

  • /\ means and.
  • A prime mark like status' means the value of status in the next state.
  • \in means element of.
  • The shape [][P]_v means "always ([]) — in every step of every behavior — the inner condition P holds, considering only steps that may change v." Think of it as a temporal forall: "for every step, this is true." The _status part is the variable the property is about.

The rest reads like predicate logic.

------------------------------ MODULE Todo ------------------------------
VARIABLES status
 
States == {"active", "done", "archived", "deleted"}
 
Init == status = "active"
 
Complete == status = "active"             /\ status' = "done"
Archive  == status \in {"active", "done"} /\ status' = "archived"
Delete   == status' = "deleted"           \* delete works from any state
 
Next == Complete \/ Archive \/ Delete
 
\* The property we want: once deleted, always deleted. This is a
\* temporal property (about behaviors over time), not a state
\* invariant — so it goes under PROPERTY in the .cfg, not INVARIANT.
DeletedIsTerminal == [][status = "deleted" => status' = "deleted"]_status
 
Spec == Init /\ [][Next]_status
=========================================================================

with a one-line Todo.cfg:

SPECIFICATION Spec
PROPERTY DeletedIsTerminal

Run TLC:

Model checking completed. No error has been found.
8 states generated, 4 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 2.

Four reachable states. The exact four. TLC walked every path from active and confirmed DeletedIsTerminal holds across all of them. (For bigger specs you bound the constants and it's "checked up to that bound," but the guarantee is the same in kind.)

Add an undo delete transition — Restore == status' = "active" — and TLC fails instantly, handing back the exact path active → deleted → active, down to the line and column of the offending action. That's the same counterexample the playground above produces live when you add the restore edge yourself.

For a Todo app this is wild overkill — discriminated unions already catch 99% of what a model checker would find in your CRUD code. The mistake is reading that as "model checking is academic." Flip it: for a payment retry protocol, a state machine spanning microservices, a leader election, a permission model — anything where concurrency or partial failure can interleave — a spec is the cheapest place to find the bug. Those interleavings are exactly what no test enumerates and no code review catches by eye. AWS and Azure don't run TLA+ to show off; at that scale the alternative is finding the bug in production, once, expensively.

So treat it as part of the job, not a research detour. If you build or change distributed or safety-critical state machines, writing and checking a small spec belongs in your toolbox right next to writing tests — reached for whenever the state is concurrent, replicated, or irreversible. It's a normal engineering skill, not a PhD.

The expensive half just got cheap

Here's the part that changed recently, and it changes the whole calculus. The cost of formal methods was never the checker — TLC and Z3 are free and finish in seconds at this scale. The cost was the spec: somebody with rare knowledge had to write it. When Hillel Wayne asked "why don't people use formal methods?", his answer was essentially pedagogy — the barrier was learning to write specs, not running them.

That barrier just collapsed, from two directions at once.

From above: LLMs write passable TLA+ now, and there are off-the-shelf Claude Code skills for exactly this loop — write the spec, run TLC, debug the violation. One discipline makes it safe: the checker, not the model, is the ground truth. An LLM will happily emit a plausible spec that's secretly the textbook's Paxos rather than your system — a 2026 study caught a "spec for Etcd's Raft" that was really the Raft paper's appendix. A wrong spec that goes red costs minutes; the quiet risk is the one that goes green, because a checker only vouches for the spec it was given. So generation doesn't remove the human — it moves your job from writing the spec to reading it, which for these state machines takes minutes.

From below: if your state machine already exists in code, the spec doesn't need to be written or generated — it can be extracted, mechanically, with no model in the loop at all. Your aasm block, your XState machine, your reducer already are the spec; a few hundred lines of extractor lift them into a checker's input format. That's the chore I automated — more on prune-states below, or try it right now in the browser playground.

Extraction is also how you ask rude questions of machines you didn't write. I pointed my extractor at GitLab's Ci::Pipeline — 13 statuses, 12 events, the state machine behind every CI run on gitlab.com — and asked the question everyone would bet on: surely canceled is final? Z3 returned the edge that says no: enqueue re-runs even canceled pipelines, because that's what the Retry button is. The machine has zero terminal states, by design. Not a bug — but it falsifies an assumption plenty of cleanup jobs and billing queries would happily ship on top of. That's the other thing a prover is for: not just catching bugs, but breaking the assumptions you'd otherwise build on. On a 37-transition machine, nobody audits that by eye.

Either way, the bottleneck moved. Formal verification used to be gated on a skill. Now it's gated on a habit.

And there's a bigger reason this matters right now. LLMs write a growing share of production code, and nobody reviews every generated line — that's the point of generating them. But you don't have to review every line. You can constrain every state. A discriminated union the model can't construct illegal values for, a CHECK constraint that rejects the write, an asserted invariant that fails CI the day a generated diff adds the wrong transition — these hold no matter who, or what, typed the code. The state space is the contract you keep while the model holds the keyboard. In the era of generated code, this discipline stops being one more best practice and becomes the way a human stays in charge of a codebase they didn't write.


The cut no app can bypass

Every cut so far lives in your code — which means it stops the moment data crosses a boundary you don't own: the JSON the API just received, the row another service wrote, the message off Kafka. Your Order type guarantees nothing about any of them.

Step back, and every technique in this article — unions, state machines, property tests, model checks — is enforcing the same thing: an invariant, a condition your data cannot violate. Bugs are violated invariants. Crashes, security holes, corrupt rows — all the same shape: "this was supposed to be true and wasn't." You're enforcing them either way, implicitly or explicitly. The only question is where the cut lives, from weakest to strongest:

weakest: does nothing ─────────────────────────▶ strongest: no app can bypass
comments · runtime assert · tests · linters · types · schema · DB constraints
MechanismWhen checkedWhat happens on violation
Comments / docsNeverNothing
Runtime assertAt call site, sometimesCrash, hopefully in dev
TestsAt CI timeBuild fails (for cases you wrote)
LintersAt lint timePR fails (for patterns you encoded)
TypeScript typesAt compile time, erased at runtimeBuild fails, but can be bypassed with as any and stops at the network boundary
Strong types (Rust, Haskell)At compile time, harder to bypassBuild fails earlier and unsafe is opt-in rather than the escape hatch
Schema validation (Zod, Pydantic) — a library that checks incoming data against a declared shapeAt system boundary, at runtimeReject input; on success, the type carries the invariant forward
Database constraintsAt every write, by every client, foreverINSERT/UPDATE rejected — the only layer no application can bypass

The thing nobody emphasizes enough: the database is one of your strongest invariant-enforcement layers, and most teams underuse it.

Each constraint deletes a whole class of bad state — and unlike your types, they hold on every write, from every client, forever, no matter which service forgot:

  • NOT NULL eliminates a state.
  • UNIQUE eliminates a class of duplicate states.
  • FOREIGN KEY eliminates orphan-reference states.
  • CHECK (status IN ('active', 'done', 'archived')) collapses an unbounded text field to three legal values.
Why CHECK and not a native ENUM?

Evolving the set stays an ordinary migration. A Postgres ENUM only lets you ADD VALUE — dropping or reordering one means recreating the whole type and every column built on it. text + CHECK is the trade most teams at scale make: GitLab's schema alone carries 35 enum-whitelist checks doing exactly an ENUM's job.

And the teams at serious scale already live this way. GitLab's schema declares 2,419 CHECK constraints — 82 of them num_nonnulls(...) = 1 exclusive-arc checks ("exactly one of these columns is set," the database half of a discriminated union). It's institutionalized, not incidental: a first-class migration helper, add_multi_column_not_null_constraint, makes pushing "exactly one of these is set" into the schema a one-liner any engineer reaches for. Not aspirational — it's what a codebase looks like once it takes the database seriously as an enforcement layer.

Here's the uncomfortable part. Your model already has a spec for which combinations are legal — it's just split across two layers that don't talk to each other. The application side is the pile of conditional validations every non-trivial model accumulates (validates … if:, validate, conditional callbacks) — GitLab's app/models carries 173 of them, Mastodon 72 — and that's literally a decision table: the predicates are the parameters, the rules fire on combinations nobody enumerates. But those validations drift, because they're code that has to be remembered and update_column skips them entirely. The database side — the CHECKs and partial indexes above — can't drift, because the constraint is the enforcement. So the legal state space is half-declared in a layer that rots and half-declared in a layer that can't. The bugs live in the gap between them.

Postgres footnote: UNIQUE treats NULL as distinct

By default, UNIQUE treats NULL as distinct, so UNIQUE (email) still lets you store any number of rows with NULL email — easy to miss when you're using UNIQUE to enforce "at most one of these." Postgres 15+ added UNIQUE NULLS NOT DISTINCT to make NULL collide with itself. Worth knowing exists.

Two Postgres features in particular deserve more than a passing mention, because they're the ones most application teams skip.

A partial unique index is the idiomatic way to enforce state-machine cardinality at the persistence layer:

CREATE UNIQUE INDEX one_active_subscription_per_customer
  ON subscriptions (customer_id)
  WHERE status = 'active';

After this, the database itself refuses to ever store two simultaneously active subscriptions for the same customer. No race condition, no application bug, no service in another language can produce that state. This isn't a clever trick either — GitLab's schema carries 163 of these state-gated partial unique indexes (WHERE status = ...), enforcing "at most one row in this state" across the codebase.

An EXCLUDE constraint generalizes the same idea to ranges and overlaps. Imagine room bookings:

ALTER TABLE bookings ADD CONSTRAINT no_overlap
  EXCLUDE USING gist (
    room_id WITH =,
    during  WITH &&
  );

Every booking system reimplements "no double-booked rooms" in application code, badly, with race conditions the database just refused to permit.

And if you want to know what a missing constraint costs, the cleanest case isn't Knight Capital — it's Robinhood, 2020–21. The app showed customers false negative balances and fired 84,100 erroneous margin calls; FINRA's record $70 million penalty cited, among the harms, a twenty-year-old who took his own life after the app displayed a false negative balance of $730,000. Knight was a deployment failure wearing a state-space costume. Robinhood is the pure version: "never show a user a balance the ledger doesn't support" is an invariant, and no layer of the stack enforced it.

That's the cut a type can't make: it held even when every layer of application code — the validation, the guard, the recompute — was wrong. Schema validation at the boundary helps; a database constraint helps more, because it doesn't trust the application to be right.

So don't pick one mechanism — stack them. Each layer in that table cuts the state space a little; together they pin the system down to something close to "only valid states are reachable."

Parse, don't validate: the bridge between runtime and compile time

We met this idea up top; here's where it earns its keep in code.

A blob standing in a doorway like a gatekeeper: a chaotic pile of tangled dead branches outside, a tidy living leafy tree inside — parse at the door, trust the type within.

Parsing checks once at the boundary and returns a type with the invariant built in — not a boolean everything downstream has to re-check:

// Validation: check, hope, repeat everywhere
function isValidOrder(input: unknown): boolean {
  /* ... */
}
function processOrder(input: unknown) {
  if (!isValidOrder(input)) throw new Error("bad")
  // input is still `unknown`. You'll check again. And again.
}
 
// Parsing: check once, type carries it forever
const result = OrderSchema.safeParse(rawJson)
if (result.success) {
  const order: Order = result.data // ← invariant now lives in the type
  processOrder(order) // No checks downstream. Compiler enforces.
}

Zod, Pydantic, io-ts, Valibot — same trick: a runtime invariant goes in, a type-level invariant comes out. Pay the cost once, at the door; the compiler enforces it for the rest of the program's life.

Every team I've worked with has a version of this story: a database migration written years after the table was created, adding the constraints that "should have been there on day one." A CHECK here. A foreign key there. Conditions that turn out to be "obviously true" except for the few thousand rows where they aren't, which you clean up before the constraint goes on. A few weeks of work. Months of bugs that suddenly stop happening.

Pick your favorite invariant. Encode it somewhere stronger than a code comment. Repeat until your weekends stop being interrupted.

When not to make states unrepresentable

One honest caveat before the tools, because this whole approach has a failure mode: you can over-tighten.

Sean Goedecke's counterpoint — "Make invalid states unrepresentable" considered harmful — deserves engagement, not dismissal. Every hard constraint is a bet that the rule will never change. In stable domains the bet pays: balances don't go negative, bookings don't overlap, deleted stays deleted. But in young, user-facing domains the rules are still being discovered — and a rule wired into the type system, the schema, and three services is the most expensive possible place to discover it was wrong. Migrations feel this worst: the foreign key that blocks the backfill, the required field that breaks schema evolution, the CHECK you drop at 2 a.m. so the fix can ship.

The resolution is to sort your invariants by how true they are. Hard-constrain the ones that hold by definition of the business — an order has exactly one customer, a refund never exceeds its charge. Keep policy soft — orders over $500 need review, trials last 14 days — because policy is the thing product changes next quarter. The skill isn't maximal rigidity. It's knowing which states are illegal forever and which are merely illegal today.


The branches that survived review

So I built the cutter. prune-states (gem prune_states, npm prune-states) is an MVP — loudly — in two layers, straight out of the patterns above: Ruby and TypeScript, same verdicts. It hunts the branches that slipped past every human reviewer.

  • prove reads a state machine you already have — aasm or state_machines in Ruby (≈99% of verifiable Ruby state machines by downloads), or an XState machine in TypeScript — and proves or breaks two kinds of invariant. Single-field ("once X, always X") becomes a one-step check handed to Z3, Microsoft Research's solver: given any transition out of X, is the next state still X? Cross-field ("these two columns can't contradict") composes two machines and hunts for a reachable forbidden joint state. One question, nothing fancier — can the machine reach a state it shouldn't? No timing, no "does it eventually recover," no infinite state; the slice most production machines actually need. Here it is catching the classic single-field bug:

    $ prune-states OrderBuggy --require app/models/order_buggy.rb \
                                --assert-terminal refunded
    Definition: OrderBuggy
      states: pending, paid, shipped, delivered, refunded
      invariants: 1 (1 asserted)
     
      ✗ RefundedIsTerminal — Asserted: once refunded, always refunded.
          counterexample:
            kind = "refunded"
            kind_next = "pending"
     
    1 violation(s), 0 unknown(s), 0 error(s).

    Exit code 2 — CI catches it the day someone adds a reactivate event that re-opens refunded orders. You can run this exact check in your browser: the playground evaluates the same encoding live and shows the generated SMT-LIB.

    But the single-field "once X, always X" check isn't the dangerous one. The bugs that actually cost money are cross-field: two status columns on one record that quietly contradict each other. An invoice marked voided while its payment_status still reads succeeded — money captured against a cancelled bill, invisible to the ledger. A return item marked accepted (and refunded) while its reception_status is still awaiting — cash out the door for goods that never came back. prove checks these by composing the two machines and asking whether the forbidden joint state is reachable — and if it is, handing back the shortest sequence of events that gets there:

    ✗ VoidedIsNeverPaid — voided and payment=succeeded must not co-occur.
        counterexample (reachable from the initial state):
          status = draft,     payment_status = pending
          status = finalized, payment_status = pending
          status = voided,    payment_status = pending
          status = voided,    payment_status = succeeded   ← the bug, as a reachable path

    I didn't invent those two examples. I pointed the tool at two widely-deployed open-source money systems — a billing engine and a commerce platform — and it surfaced exactly these contradictions. Neither is a unit-test toy: both reproduce at the deployed layer — a payment webhook handler and an admin controller route — red-to-green in the projects' own test suites, each closed by a one-line guard. (A candidate-miner does the finding: point it at a model's status fields and it ranks the pairs worth checking, so a human confirms a short list instead of reading every model.)

    It runs on the real machines, not sanitized snippets — including booting Solidus's metaprogrammed define_state_machine! to verify the Order graph that actually ships — and it's honest about its own limit: one invariant comes back a false positive because a recalculate callback re-transitions behind the graph's back, invisible to declaration-level extraction. (Full dogfood runs — including GitLab's Ci::Pipeline — are in the repo.)

  • cover — the Step 3 tool: turns a 2^N flag space into the dozen-odd covering-array tests the NIST interaction rule says actually matter. Pure TypeScript and Ruby — no PICT, no Java, no native binary — with an independently-verified coverage report.

It's all intentionally small. The point was never the perfect tool — it's to make these cuts land in your codebase before you write the next boolean. Because every boolean you write today is a Friday night you won't get back.


The political problem

Everything so far is technique: discriminated unions, covering arrays, database constraints, model checkers. All useful. None of them solve the actual problem.

The actual problem is that someone in a meeting says "what if we let users decide?" and nobody in the room does the math on what that sentence costs in state space. The actual problem is that the PM wants "just one more option" and doesn't realize that one option doubles the configurations QA needs to verify. The actual problem is every system whose flexibility was its selling point and ended up being its downfall.

Your job as a programmer — the part they don't teach you, the part that isn't writing code — is to push back. Not because you're lazy. Not because you don't want to build things. Because you're the only person in the room who understands that complexity isn't linear. Two features don't make a system twice as complex. They make it four times as complex. Three features? Eight times. And nobody notices until it's too late, because each feature looks simple on its own.

This isn't unique to software. Every profession that's been around long enough develops an immune system against requested complexity. Editors cut the chapter the writer loves most, because every darling makes the rest of the book harder to read. Doctors refuse the antibiotic the patient Googled, because every prescription carries interactions the patient isn't pricing. The professional knows what the client doesn't: additions have invisible costs, and the costs compound. Programmers do the same job in a different vocabulary — everyone around you has a structural incentive to add; yours is to cut, and to back it up with the math.

And let's be honest about what "push back" costs, because the advice is cheap and the act isn't: it's political capital, spent against an incentive structure that rewards the person shipping the feature, not the person counting its states. I've watched correct math lose to a roadmap commitment. That's not a reason to skip the math — it's the reason to bring a number instead of an opinion. "I don't like this design" is dismissible. "This change costs 96 new states and here's the bug class they breed" at least forces someone to own the trade.

Better yet, make the math automatic. Give the state space a budget, the way you budget bundle size: a CI bot that comments on the PR — "this change adds two booleans to CheckoutProps: 32 representable states → 128." One caveat from everyone who has watched bundle-size bots die: a gate gets disabled the first time it blocks a roadmap feature. Ship it as information, not enforcement — a number on the diff, a champion senior enough to keep it normal — and it survives long enough to change what "normal" means.

The data migration is cheaper than what comes after — cheaper than the bugs you can't reproduce because you can't reconstruct the state, cheaper than the escalations from configurations nobody anticipated, cheaper than the team you eventually staff to answer "what is this customer's setup even supposed to look like?"

Simple systems are not a luxury. They're a survival strategy.

Start cutting

You don't need the whole ladder on day one. Pick one cut and make it.

Today. Find one model with three-plus status booleans, or a status string with no CHECK behind it. Replace the booleans with a discriminated union, or add the constraint. One bad state that can no longer be spelled — the cheapest cut there is, and you'll feel it.

This week. Take the cross-field invariant your app assumes but never enforces — "a voided invoice is never paid," "an active subscription has a customer" — and push it into the database: a CHECK, a partial unique index, an EXCLUDE. The cut no service can route around.

This month. Point prune-states at a state machine you own and let it tell you which branch you forgot. Or give your state space a budget — the CI comment from the last section — so the next doubling lands on the diff instead of in production.

Each cut is small. Each deletes a class of bug for good. Stack a few and your weekends stop getting interrupted.

Steal these prompts. Drop one into your coding agent — Claude Code, Codex, Cursor. It reads your repo, so there's nothing to paste:

Find a model with several boolean or status fields and refactor it into a
discriminated union so the invalid combinations can't be constructed. Add the
Postgres CHECK / partial index that enforces the same invariant at the database
layer. Tell me which states you just deleted.
Pick a model with two status fields and list every pair of values that's
representable but must never co-occur (e.g. voided + paid). For each, add the
assertion or test that fails the moment that combination becomes reachable.
Find a cluster of feature flags read together. Which combinations are nonsense
or never tested? Collapse them into one typed config, and generate a
covering-array test set that hits every remaining pair.

In Claude Code those three are one command: /plugin marketplace add zernie/prune-states, then /prune-states:make-states-unrepresentable — it finds the worst offender in your repo and refactors it. The model proposes; the compiler and the CHECK constraint judge.

We opened with a timeline forking out of control — yours, and now the model's. You can't read every branch it writes; that ship has sailed. But you were never going to win by reading faster. You win by making the bad branches impossible: a state that can't be spelled is a bug that can't ship, no matter who — or what — is holding the keyboard.

Prune the timeline. One cut at a time.


References