
Contents
- Twenty flags is a million timelines
- Now the AI is branching it too
- Pruning means the bad branch can't be spelled
- When a type can't make the cut
- The cut no app can bypass
- The branches that survived review
- The political problem
- Start cutting
- References
ArticleWidget case) or a plain ; keep it a no-op until the file lands so the .md stays portable. -->
Twenty flags is a million timelines
Open your feature flag dashboard. Count the flags older than six months you "meant to clean up." I'll wait.
Now multiply them. Each flag forks the timeline — on or off. One flag, two timelines. Two, four. Three, eight. It doubles every single time.
Your feature flags
20 flags
1,048,576
possible configurations — doubling with every flag
Testable? no — nobody writes 1,048,576 test cases
Test one configuration per second, around the clock: about 12 days to get through them.
What actually gets tested: the dozen you flip on purpose. The rest just exist — in production.
Drag it to your team's number. I'll wait for that too.
Nobody tests a million timelines. You test the dozen you flip on purpose for cohorts. The rest just exist. In production. Never seen by QA, waiting for the right user to load the right page in the right order.
I keep coming back to Knight Capital. 2012: $440 million gone in 45 minutes. One feature flag, eight servers. Seven got the new deploy, one didn't — and when the flag flipped on, server eight ran eight-year-old dead code nobody had pruned. The whole state space was sixteen combinations. They'd tested one. Forty-five minutes ended the company. The full postmortem is worth your time.
And the bug never looks like a bug. It looks like "customer X is seeing the wrong price." You chase it as a race condition until someone pulls the flag history and sees this exact three-flag combo has never happened in the system's life, never will again after Friday's release, and is breaking precisely one invoice right now. Nobody designed that state. Every flag did what it said on the toggle. The combination was a timeline no human ever imagined.
Nobody told me this in school. Most bugs aren't logic errors. They're states you forgot existed — branches of the timeline nobody meant to leave open.
And it isn't just flags. A form with 15 optional fields has 32,768
presence-and-absence combinations. A user model with isAdmin, isVerified,
isBanned, isDeleted has 16 states, about four of which make sense.
Subscription billing with its out-of-order Stripe webhooks, a background job
that retries through a half-finished write, a distributed system where every
message can arrive twice or never — same shape every time. The set of states
your code can be in dwarfs the set it actually handles.
The gap between the two is where every bug lives.

And you can't test your way out — the math is against you, always has been. Knight was the cheap version; about a quarter of cloud outages trace back to misconfiguration nobody tested in combination. The state space expands by default; your ability to verify it grows linearly at best. The gap widens every time someone ships a feature.
Now the AI is branching it too
Here's what changed in 2026: you're not the only one forking the timeline anymore.
An LLM writes half your code now, and nobody reads every line — that's the point of generating it. The branches multiply faster than any human can review. Reviewing-by-reading, the thing we've leaned on since forever, is quietly over.
So stop trying to read every branch. Prune the ones that should never exist — make them impossible to even write down. Done right, the state space stops being a liability and becomes the one contract you can still enforce while the model holds the keyboard. A branch the AI can't spell is a bug it can't ship.
(Yes — like the TVA pruning timelines in Loki. That's literally why the tool
at the end of this piece is called prune-states.)
That's the whole article: find the branches that shouldn't exist, and cut them. Cheapest cut first.
Pruning means the bad branch can't be spelled
The two ideas under all of this are old, both from the functional programming world. Make illegal states unrepresentable — Scott Wlaschin's phrasing of an idea Yaron Minsky was teaching a decade earlier: design the shape itself so the bad states have no spelling. If a state shouldn't exist, make it so you can't even construct it. Parse, don't validate — Alexis King: check incoming data once, at the boundary, and return a type that carries the proof, so the whole interior of your program runs on values correct by construction and never re-checked.
Same move, two angles: shrink what's representable until it matches what's valid, and do the cut once, at the door. Both work in any modern language — not just Haskell.
Kill the booleans
The classic example. You have an order with three boolean fields:
interface Order {
isDraft: boolean
isPublished: boolean
isArchived: boolean
}Three booleans. Eight possible states. But only three are valid: draft,
published, archived. The other five — isDraft && isPublished,
isPublished && isArchived && isDraft, etc. — are nonsense. But your
code has to deal with them because the type system says they exist.
Order · 3 status booleans
Three booleans, eight timelines — but only three are real states. The other five are branches nobody designed. Prune them.
The fix is a discriminated union — a type that says "this value is exactly one of these shapes, never a mix." Some languages call it a "tagged union," "sum type," or "variant." Same idea everywhere:
// TypeScript
type Order =
| { status: "draft"; content: string }
| { status: "published"; content: string; publishedAt: Date }
| { status: "archived"; content: string; archivedAt: Date }Eight states → three. publishedAt only exists when the order is published.
archivedAt only exists when it's archived. You can't construct a
published-and-draft order because there's no type for that. Not
"it's not allowed at runtime" — it does not exist as a type. The bad
states aren't rejected; they were never representable to begin with.
And when you switch on this, the compiler makes you handle every case:
// TypeScript — miss a case and it won't compile
function render(order: Order) {
switch (order.status) {
case "draft":
return renderDraft(order.content)
case "published":
return renderPublished(order.content, order.publishedAt)
case "archived":
return renderArchived(order.content, order.archivedAt)
}
}And the same idea works at the database level. SQL is weaker than a type system, but it's enforced on every write, by every client, forever:
-- PostgreSQL — the database refuses to store illegal states
CREATE TABLE orders (
id UUID PRIMARY KEY,
status TEXT NOT NULL CHECK (status IN ('draft', 'published', 'archived')),
content TEXT NOT NULL,
published_at TIMESTAMPTZ,
archived_at TIMESTAMPTZ,
-- Same invariant as the discriminated union, at the persistence layer.
-- Note the archived branch drops `published_at` — matching the TS
-- variant where archived doesn't carry publishedAt. If you want to
-- preserve publish history through archive, model it explicitly.
CHECK (
(status = 'draft' AND published_at IS NULL AND archived_at IS NULL) OR
(status = 'published' AND published_at IS NOT NULL AND archived_at IS NULL) OR
(status = 'archived' AND published_at IS NULL AND archived_at IS NOT NULL)
)
);Less elegant. Less expressive. But your TypeScript types stop at the network
boundary, and your database doesn't. Even if some service in another language
forgets the invariant, the CHECK constraint catches it. We'll come back to
this.
Add a fourth status? The compiler tells you every place that needs updating. Not "please remember to check everywhere." The code won't build until you do.
Same trick, applied to the running example
The same idea works on whatever multiplicative thing you're modeling. To close the loop on the running example: instead of N independent feature flag booleans, define the valid configurations as a union.
// Before: 4 booleans = 16 possible states
const newUi = client.boolVariation("new-ui", false)
const newPayments = client.boolVariation("new-payments", false)
const fullRedesign = client.boolVariation("full-redesign", false)
const layout = client.stringVariation("layout", "single-page")
// What does newPayments=true + newUi=false even mean?
// What about fullRedesign=true + newUi=true?
// Nobody knows. Nobody tested it.// After: 4 valid configurations
type CheckoutConfig =
| { variant: "control" }
| { variant: "new-ui"; layout: "single-page" | "multi-step" }
| {
variant: "new-ui-with-payments"
layout: "single-page" | "multi-step"
provider: "stripe" | "square"
}
| { variant: "full-redesign" }Sixteen states → four. The provider field only exists on the variant that
uses it. layout doesn't exist on control. You can't accidentally check
for config.provider when the user is in the control group — TypeScript
won't let you.
Do the conversion in one place. A single resolve function maps the raw flag
values into a typed config at the boundary; after that, the type system does
the rest.
▶One resolve at the boundary, one exhaustive switch
function resolveCheckout(raw: RawFlags): CheckoutConfig {
if (flag(raw, "full-redesign")) return { variant: "full-redesign" }
if (flag(raw, "new-ui") && flag(raw, "new-payments")) {
return {
variant: "new-ui-with-payments",
layout: flagStr(raw, "checkout-layout", "single-page"),
provider: flagStr(raw, "payment-provider", "stripe"),
}
}
if (flag(raw, "new-ui")) {
return {
variant: "new-ui",
layout: flagStr(raw, "checkout-layout", "single-page"),
}
}
return { variant: "control" }
}
// Exhaustive — won't compile if you miss a variant
function render(config: CheckoutConfig) {
switch (config.variant) {
case "control":
return renderLegacy()
case "new-ui":
return renderNewUi(config.layout)
case "new-ui-with-payments":
return renderFull(config.layout, config.provider)
case "full-redesign":
return renderRedesign()
}
}Parse, don't validate. At the boundary. Once.
And the invoice bug from the opening? With the flags collapsed into a union, that three-flag configuration no longer has a spelling.
When a type can't make the cut
The type is the cheapest cut, but not everything fits in one. Sometimes the state space is genuinely large and you can't shrink it further; sometimes the business won't let you. So you keep cutting — a small ladder of tools, each one costing more and buying a stronger guarantee than the last:
weaker: legal moves only ──────────────────▶ stronger: proved across all states
state machines · property tests · combinatorial · model checking
Step 1 — State machines
Same cut, one level up: instead of pruning bad states, prune bad moves. A
flat list of booleans lets you set isPublished true and forget to clear
isArchived. A state machine declares the legal transitions up front and
refuses the rest. Draft can become published. Published can be archived.
Archived can't go back to draft. Ever.
▶Step 1 in Ruby (AASM) and TypeScript (XState)
# Ruby with AASM
class Order < ApplicationRecord
include AASM
aasm do
state :draft, initial: true
state :published, :archived
event(:publish) { transitions from: :draft, to: :published }
event(:archive) { transitions from: [:draft, :published], to: :archived }
# Note: no transition back from :archived. Once archived, terminal.
end
end
order.publish! # OK
order.archive! # OK from draft or published
archived_order.publish! # AASM::InvalidTransition — refused at runtimeThe machine is the documentation and the enforcement. Reachable states are finite, declared in one place, and any transition not in the table gets refused. The graph itself is the type. (When a flat list isn't enough, Harel's statecharts add nesting and parallelism — that's where XState's hierarchical states come from.)
Step 2 — Property-based testing
When the space is too big to enumerate and you can't shrink it further, the
laziest probe that still finds real bugs is to sample it at random. Instead
of "given input X, expect output Y," you declare a property that should
hold for any input and let the framework throw hundreds of random cases at
it trying to break it. fast-check
(TypeScript), Hypothesis (Python), and
PropCheck (Ruby) all do this. You're
not covering the whole state space — you're sampling it intelligently, which
still beats the three cases you'd have picked by hand. And when the thing
under test is a state machine rather than a pure function, the stateful
variants — Hypothesis's
RuleBasedStateMachine,
fast-check's model-based commands —
random-walk the transitions, not just the inputs: a model checker that
samples paths instead of enumerating them.
But random sampling has a blind spot worth seeing. Here's the same 32-configuration space with a two-flag bug planted in it, sampled three ways: exhaustively (all 32), with a covering array (every pair by construction — the next rung), and at random. Flip between them, move the bug, roll the dice:
One space · three strategies
5 flags · 32 configurations
The planted bug fires when A = off and D = off — a two-flag interaction, where 70–97% of real-world failures live.
Caught — and it always will be. These 6 rows contain every pair of flag values at least once, so any two-flag bug is in them by construction. Move the bug and watch. (Two-flag bugs are the guarantee — that's what t = 2 buys. A three-flag bug needs strength-3 rows; see the caveat in the prose.)
A property test rolls the dice a few hundred times; the one combination that sinks you might be the roll it never made. The next rung closes exactly that gap — sample systematically instead of randomly, and the buggy pair has nowhere left to hide. That's the guarantee you just watched the covering array keep, every time you moved the bug.
Step 3 — Combinatorial testing
Here's the one piece of good news in the whole article. Go back to the opening. Twenty flags, 1,048,576 configurations, and the despair was "nobody can test a million configs." Turns out you don't have to. Decades of empirical work at NIST found that real bugs cluster at low order: 70–97% of failures are triggered by a single parameter or one pair of them, ~98% by three or fewer, and across every dataset they studied, no real-world fault has ever needed more than six interacting parameters. The million-config space is real, but the bugs hiding in it almost all live in its pairs and triples.
So you don't cover the million — you cover the pairs. A covering array is
a small set of test configurations that includes every pair of values at least
once, and it stays small: all pairs among those twenty flags fit in as few
as seven configurations. 1,048,576 → seven. That's the number the opening
was missing. (The three-flag invoice bug from the opening? Bump the strength
to t = 3 and its exact combination lands in the rows by construction.)
Almost nobody does this: GitLab's open-source repo ships roughly five hundred
feature flags, documents and tests them one flag at a time, and has no
combinatorial strategy at all — and they're more disciplined about flags than
most of us. The generators are off-the-shelf —
NIST's own ACTS
and Microsoft's PICT on the command line,
or a one-line library right inside your test runner (concrete example below).
Don't take the collapse on faith — build it. Drag the flag count and watch
2^N drop to a handful of rows, then click any two columns and confirm all
four value pairs are still there:
Build the covering array
5 flags
32
exhaustive — every configuration
6
covering array — every pair
Click any two columns to inspect a pair:
All four present. Pick any other two columns — same story. Every pair of flag values lives in these 6 rows, not the 32 the exhaustive grid would need. At twenty flags that ratio is 1,048,576 → seven.
▶How covering arrays work, and the tools that build them
Three booleans — eight combinations — need just four rows to hit all pairs:
A B C
0 0 0
0 1 1
1 0 1
1 1 0Check it: every pair of columns shows all four of 00, 01, 10, 11. The
magic is how that scales. Covering-array size grows with the logarithm of
the parameter count, and exponentially only in t — the interaction strength
you're covering — not in the number of parameters. That's why twenty flags
collapse to a handful of rows instead of a million.
You write one text file — PICT, Microsoft's generator, is the usual default:
new-ui: on, off
new-payments: on, off
full-redesign: on, off
layout: single-page, multi-step
IF [full-redesign] = "on" THEN [new-ui] = "off";pict model.txt prints the rows; each row is one test config. Constraints
(the IF/THEN) keep it from generating combinations you've already made
illegal. NIST's own ACTS tool
does the same up to 6-way, with a GUI or CLI.
The easiest path skips the shell-out entirely: a covering-array library generates the rows in-process and hands them straight to your test runner's parametrize hook — no binary, no parsing. In Python with allpairspy:
from allpairspy import AllPairs
import pytest
flags = {
"new_ui": [True, False],
"new_payments": [True, False],
"full_redesign": [True, False],
}
@pytest.mark.parametrize("cfg", AllPairs(list(flags.values())))
def test_checkout(cfg):
new_ui, new_payments, full_redesign = cfg
assert checkout_works(new_ui, new_payments, full_redesign)Three flags, all pairs, in the handful of cases the table above predicts — just a decorator, no new infrastructure. Pick whichever fits your stack:
| Where you run it | Tool |
|---|---|
| CLI, any language | PICT · NIST ACTS (≤ 6-way) |
| Python test runner | allpairspy |
| TypeScript / browser | covertable · pict-pairwise-testing · prune-states (cover) |
One mechanical caveat: pairwise is still sampling. It provably covers every
pair, but a fault that only fires on a specific three- or four-way combination
can slip through (a medical-device study found a real four-way one). Bump to
--o:3 for anything safety-critical; the cost is more rows, not a different
tool.
That last point — pairwise is sampling, not proof — is also the one most in the spirit of this article: combinatorial testing is the consolation prize. If you can collapse those flags into a discriminated union so the nonsense combinations can't exist, do that — a bug that can't be represented needs no test at all. Covering arrays are for the irreducible remainder: the flags you genuinely can't merge, the legacy config you haven't refactored yet, the third-party surface you don't control.
But even for that remainder, a covering array still only samples. It guarantees every pair — bump the strength and every triple — yet that's a fixed interaction order across a fixed set of parameters, not a proof across every state your system can reach. The four-way bug slips a pairwise array; the bug that needs a specific sequence of transitions was never an "interaction" at all. Which begs the question: what if you could stop sampling — not a few hundred random configs, not just the pairs, but prove the property holds across every reachable state your system has?
Well. There is.
Step 4 — Model checking (formal verification)
The heaviest step. A model checker takes a precise description of your system — every state, every transition — and exhaustively explores all of them. Not random samples. Every reachable state, by construction. When an invariant fails, it hands you the exact sequence of steps that broke it.
Sounds academic until you notice who runs it. AWS has used TLA+ on S3 and DynamoDB since 2011. Azure verifies Cosmos DB the same way. seL4 — a microkernel proved correct down to the assembly — flies in military drones. Not research prototypes. Production systems you depend on right now.
And the lesson isn't "everyone should write TLA+." AWS's 2025 follow-up is blunter than that: their correctness work now spans a whole toolkit — the P language, property-based testing, deterministic simulation — because TLA+ alone didn't scale to thousands of engineers. The takeaway is the one this section is built on: pick the cheapest tool that exhausts your state space.
The entry barrier is one Java JAR. Let me show you the smallest version that makes the point.
Take our Todo machine: four states (active, done, archived, deleted)
and the transitions between them. State a property it must never break —
once deleted, always deleted — and TLC, the checker, walks every path from
the start state and confirms it holds at each one. Four reachable states, all
four checked. Not "I tried a handful of cases" — checked by construction,
across the entire reachable state space. (One honest caveat: that's a proof
of your model, not your system — which is exactly what you asked for.
TLAPS proves; TLC checks.)
The payoff lands when someone changes the machine. Imagine a teammate adds
an "undo delete" transition, so deleted can go back to active. Looks
innocent. Compiles fine in any language. A property test might miss it
unless the random generator happens to delete-then-restore in sequence. TLC
catches it instantly — and it doesn't just say "failed." It hands you the
exact minimal sequence that breaks the property: active → deleted → active, down to the line and column of the offending transition. A model
checker doesn't return a red X; it returns a movie of how the system
breaks.
Don't take the prose's word for it — here is that exact machine, live. Click Todo — undo-delete bug, or edit the JSON and add the restore transition yourself, and watch the invariant break with the counterexample attached. (Under the hood it hands the same one-step check to Z3, Microsoft Research's solver — think of it as a spell-checker for logical rules. Expand any verdict to see the exact question it's asked.)
prune-states · prove — live
Open full pageA state mapping to [] is terminal. Transitions are state names or { to, event?, when?, not? } — when/not name a guard, checked as free booleans (a sound overapproximation, same as the package).
Todo · 4 states
Assert terminal
“This state should be terminal — prove it.” The escape hatch for invariants the structure no longer shows.
Invariants · 1
DeletedIsTerminalproved
Once kind = deleted, the machine stays there.
SMT-LIB — reproduce with
z3 check.smt2; DeletedIsTerminal: Once kind = deleted, the machine stays there. ; Verification by inductive negation — unsat means proved. (set-logic ALL) (declare-datatypes ((State 0)) (((s_active) (s_done) (s_archived) (s_deleted)))) (declare-const kind State) (declare-const kind_next State) ; Transition relation: kind' is determined by kind + guards. (assert (or (and (= kind s_active) (= kind_next s_done)) (and (= kind s_active) (= kind_next s_archived)) (and (= kind s_active) (= kind_next s_deleted)) (and (= kind s_done) (= kind_next s_archived)) (and (= kind s_done) (= kind_next s_deleted)) (and (= kind s_archived) (= kind_next s_deleted)) (and (= kind s_deleted) (= kind_next s_deleted)))) ; Negation of the invariant — sat means a counterexample exists. (assert (= kind s_deleted)) (assert (not (= kind_next s_deleted))) (check-sat) (get-model)
That's the whole pitch, and you don't need to read a line of TLA+ to get it — TLA+'s syntax isn't easy, which is half the reason the tool stays niche. It's only about fifteen lines, though, so if you want to see the spec and the actual output, here it is:
▶The full TLA+ spec, line by line
Four pieces of vocabulary if you haven't seen TLA+ before:
/\means and.- A prime mark like
status'means the value ofstatusin the next state. \inmeans element of.- The shape
[][P]_vmeans "always ([]) — in every step of every behavior — the inner conditionPholds, considering only steps that may changev." Think of it as a temporalforall: "for every step, this is true." The_statuspart is the variable the property is about.
The rest reads like predicate logic.
------------------------------ MODULE Todo ------------------------------
VARIABLES status
States == {"active", "done", "archived", "deleted"}
Init == status = "active"
Complete == status = "active" /\ status' = "done"
Archive == status \in {"active", "done"} /\ status' = "archived"
Delete == status' = "deleted" \* delete works from any state
Next == Complete \/ Archive \/ Delete
\* The property we want: once deleted, always deleted. This is a
\* temporal property (about behaviors over time), not a state
\* invariant — so it goes under PROPERTY in the .cfg, not INVARIANT.
DeletedIsTerminal == [][status = "deleted" => status' = "deleted"]_status
Spec == Init /\ [][Next]_status
=========================================================================with a one-line Todo.cfg:
SPECIFICATION Spec
PROPERTY DeletedIsTerminalRun TLC:
Model checking completed. No error has been found.
8 states generated, 4 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 2.Four reachable states. The exact four. TLC walked every path from active
and confirmed DeletedIsTerminal holds across all of them. (For bigger
specs you bound the constants and it's "checked up to that bound," but the
guarantee is the same in kind.)
Add an undo delete transition — Restore == status' = "active" — and TLC
fails instantly, handing back the exact path active → deleted → active, down
to the line and column of the offending action. That's the same counterexample
the playground above produces live when you add the restore edge yourself.
For a Todo app this is wild overkill — discriminated unions already catch 99% of what a model checker would find in your CRUD code. The mistake is reading that as "model checking is academic." Flip it: for a payment retry protocol, a state machine spanning microservices, a leader election, a permission model — anything where concurrency or partial failure can interleave — a spec is the cheapest place to find the bug. Those interleavings are exactly what no test enumerates and no code review catches by eye. AWS and Azure don't run TLA+ to show off; at that scale the alternative is finding the bug in production, once, expensively.
So treat it as part of the job, not a research detour. If you build or change distributed or safety-critical state machines, writing and checking a small spec belongs in your toolbox right next to writing tests — reached for whenever the state is concurrent, replicated, or irreversible. It's a normal engineering skill, not a PhD.
The expensive half just got cheap
Here's the part that changed recently, and it changes the whole calculus. The cost of formal methods was never the checker — TLC and Z3 are free and finish in seconds at this scale. The cost was the spec: somebody with rare knowledge had to write it. When Hillel Wayne asked "why don't people use formal methods?", his answer was essentially pedagogy — the barrier was learning to write specs, not running them.
That barrier just collapsed, from two directions at once.
From above: LLMs write passable TLA+ now, and there are off-the-shelf Claude Code skills for exactly this loop — write the spec, run TLC, debug the violation. One discipline makes it safe: the checker, not the model, is the ground truth. An LLM will happily emit a plausible spec that's secretly the textbook's Paxos rather than your system — a 2026 study caught a "spec for Etcd's Raft" that was really the Raft paper's appendix. A wrong spec that goes red costs minutes; the quiet risk is the one that goes green, because a checker only vouches for the spec it was given. So generation doesn't remove the human — it moves your job from writing the spec to reading it, which for these state machines takes minutes.
From below: if your state machine already exists in code, the spec doesn't
need to be written or generated — it can be extracted, mechanically,
with no model in the loop at all. Your aasm block, your XState machine,
your reducer already are the spec; a few hundred lines of extractor lift
them into a checker's input format. That's the chore I automated — more on
prune-states below, or try it right now
in the browser playground.
Extraction is also how you ask rude questions of machines you didn't
write. I pointed my extractor at GitLab's Ci::Pipeline — 13 statuses, 12
events, the state machine behind every CI run on gitlab.com — and asked
the question everyone would bet on: surely canceled is final? Z3 returned
the edge that says no: enqueue re-runs even canceled pipelines, because
that's what the Retry button is. The machine has zero terminal states,
by design. Not a bug — but it falsifies an assumption plenty of cleanup
jobs and billing queries would happily ship on top of. That's the other
thing a prover is for: not just catching bugs, but breaking the
assumptions you'd otherwise build on. On a 37-transition machine, nobody
audits that by eye.
Either way, the bottleneck moved. Formal verification used to be gated on a skill. Now it's gated on a habit.
And there's a bigger reason this matters right now. LLMs write a growing
share of production code, and nobody reviews every generated line — that's
the point of generating them. But you don't have to review every line.
You can constrain every state. A discriminated union the model can't
construct illegal values for, a CHECK constraint that rejects the write,
an asserted invariant that fails CI the day a generated diff adds the wrong
transition — these hold no matter who, or what, typed the code. The state
space is the contract you keep while the model holds the keyboard. In the
era of generated code, this discipline stops being one more best practice
and becomes the way a human stays in charge of a codebase they didn't
write.
The cut no app can bypass
Every cut so far lives in your code — which means it stops the moment data
crosses a boundary you don't own: the JSON the API just received, the row
another service wrote, the message off Kafka. Your Order type guarantees
nothing about any of them.
Step back, and every technique in this article — unions, state machines, property tests, model checks — is enforcing the same thing: an invariant, a condition your data cannot violate. Bugs are violated invariants. Crashes, security holes, corrupt rows — all the same shape: "this was supposed to be true and wasn't." You're enforcing them either way, implicitly or explicitly. The only question is where the cut lives, from weakest to strongest:
weakest: does nothing ─────────────────────────▶ strongest: no app can bypass
comments · runtime assert · tests · linters · types · schema · DB constraints| Mechanism | When checked | What happens on violation |
|---|---|---|
| Comments / docs | Never | Nothing |
| Runtime assert | At call site, sometimes | Crash, hopefully in dev |
| Tests | At CI time | Build fails (for cases you wrote) |
| Linters | At lint time | PR fails (for patterns you encoded) |
| TypeScript types | At compile time, erased at runtime | Build fails, but can be bypassed with as any and stops at the network boundary |
| Strong types (Rust, Haskell) | At compile time, harder to bypass | Build fails earlier and unsafe is opt-in rather than the escape hatch |
| Schema validation (Zod, Pydantic) — a library that checks incoming data against a declared shape | At system boundary, at runtime | Reject input; on success, the type carries the invariant forward |
| Database constraints | At every write, by every client, forever | INSERT/UPDATE rejected — the only layer no application can bypass |
The thing nobody emphasizes enough: the database is one of your strongest invariant-enforcement layers, and most teams underuse it.
Each constraint deletes a whole class of bad state — and unlike your types, they hold on every write, from every client, forever, no matter which service forgot:
NOT NULLeliminates a state.UNIQUEeliminates a class of duplicate states.FOREIGN KEYeliminates orphan-reference states.CHECK (status IN ('active', 'done', 'archived'))collapses an unbounded text field to three legal values.
▶Why CHECK and not a native ENUM?
Evolving the set stays an ordinary migration. A Postgres ENUM only lets you
ADD VALUE — dropping or reordering one means recreating the whole type and
every column built on it. text + CHECK is the trade most teams at scale
make: GitLab's schema alone carries 35 enum-whitelist checks doing exactly an
ENUM's job.
And the teams at serious scale already live this way. GitLab's schema declares
2,419 CHECK constraints — 82 of them num_nonnulls(...) = 1 exclusive-arc
checks ("exactly one of these columns is set," the database half of a
discriminated union). It's institutionalized, not incidental: a first-class
migration helper, add_multi_column_not_null_constraint, makes pushing
"exactly one of these is set" into the schema a one-liner any engineer reaches
for. Not aspirational — it's what a codebase looks like once it takes the
database seriously as an enforcement layer.
Here's the uncomfortable part. Your model already has a spec for which
combinations are legal — it's just split across two layers that don't talk to
each other. The application side is the pile of conditional validations every
non-trivial model accumulates (validates … if:, validate, conditional
callbacks) — GitLab's app/models carries 173 of them, Mastodon 72 — and
that's literally a decision table: the predicates are the parameters, the
rules fire on combinations nobody enumerates. But those validations drift,
because they're code that has to be remembered and update_column skips them
entirely. The database side — the CHECKs and partial indexes above —
can't drift, because the constraint is the enforcement. So the legal
state space is half-declared in a layer that rots and half-declared in a layer
that can't. The bugs live in the gap between them.
▶Postgres footnote: UNIQUE treats NULL as distinct
By default, UNIQUE treats NULL as distinct, so UNIQUE (email) still
lets you store any number of rows with NULL email — easy to miss when
you're using UNIQUE to enforce "at most one of these." Postgres 15+ added
UNIQUE NULLS NOT DISTINCT to make NULL collide with itself. Worth knowing
exists.
Two Postgres features in particular deserve more than a passing mention, because they're the ones most application teams skip.
A partial unique index is the idiomatic way to enforce state-machine cardinality at the persistence layer:
CREATE UNIQUE INDEX one_active_subscription_per_customer
ON subscriptions (customer_id)
WHERE status = 'active';After this, the database itself refuses to ever store two simultaneously
active subscriptions for the same customer. No race condition, no
application bug, no service in another language can produce that state. This
isn't a clever trick either — GitLab's schema carries 163 of these
state-gated partial unique indexes (WHERE status = ...), enforcing
"at most one row in this state" across the codebase.
An EXCLUDE constraint generalizes the same idea to ranges and
overlaps. Imagine room bookings:
ALTER TABLE bookings ADD CONSTRAINT no_overlap
EXCLUDE USING gist (
room_id WITH =,
during WITH &&
);Every booking system reimplements "no double-booked rooms" in application code, badly, with race conditions the database just refused to permit.
And if you want to know what a missing constraint costs, the cleanest case isn't Knight Capital — it's Robinhood, 2020–21. The app showed customers false negative balances and fired 84,100 erroneous margin calls; FINRA's record $70 million penalty cited, among the harms, a twenty-year-old who took his own life after the app displayed a false negative balance of $730,000. Knight was a deployment failure wearing a state-space costume. Robinhood is the pure version: "never show a user a balance the ledger doesn't support" is an invariant, and no layer of the stack enforced it.
That's the cut a type can't make: it held even when every layer of application code — the validation, the guard, the recompute — was wrong. Schema validation at the boundary helps; a database constraint helps more, because it doesn't trust the application to be right.
So don't pick one mechanism — stack them. Each layer in that table cuts the state space a little; together they pin the system down to something close to "only valid states are reachable."
Parse, don't validate: the bridge between runtime and compile time
We met this idea up top; here's where it earns its keep in code.

Parsing checks once at the boundary and returns a type with the invariant built in — not a boolean everything downstream has to re-check:
// Validation: check, hope, repeat everywhere
function isValidOrder(input: unknown): boolean {
/* ... */
}
function processOrder(input: unknown) {
if (!isValidOrder(input)) throw new Error("bad")
// input is still `unknown`. You'll check again. And again.
}
// Parsing: check once, type carries it forever
const result = OrderSchema.safeParse(rawJson)
if (result.success) {
const order: Order = result.data // ← invariant now lives in the type
processOrder(order) // No checks downstream. Compiler enforces.
}Zod, Pydantic, io-ts, Valibot — same trick: a runtime invariant goes in, a type-level invariant comes out. Pay the cost once, at the door; the compiler enforces it for the rest of the program's life.
Every team I've worked with has a version of this story: a database
migration written years after the table was created, adding the
constraints that "should have been there on day one." A CHECK here.
A foreign key there. Conditions that turn out to be "obviously true"
except for the few thousand rows where they aren't, which you clean
up before the constraint goes on. A few weeks of work. Months of bugs
that suddenly stop happening.
Pick your favorite invariant. Encode it somewhere stronger than a code comment. Repeat until your weekends stop being interrupted.
When not to make states unrepresentable
One honest caveat before the tools, because this whole approach has a failure mode: you can over-tighten.
Sean Goedecke's counterpoint —
"Make invalid states unrepresentable" considered harmful —
deserves engagement, not dismissal. Every hard constraint is a bet that the
rule will never change. In stable domains the bet pays: balances don't go
negative, bookings don't overlap, deleted stays deleted. But in young,
user-facing domains the rules are still being discovered — and a rule
wired into the type system, the schema, and three services is the most
expensive possible place to discover it was wrong. Migrations feel this
worst: the foreign key that blocks the backfill, the required field that
breaks schema evolution, the CHECK you drop at 2 a.m. so the fix can ship.
The resolution is to sort your invariants by how true they are. Hard-constrain the ones that hold by definition of the business — an order has exactly one customer, a refund never exceeds its charge. Keep policy soft — orders over $500 need review, trials last 14 days — because policy is the thing product changes next quarter. The skill isn't maximal rigidity. It's knowing which states are illegal forever and which are merely illegal today.
The branches that survived review
So I built the cutter.
prune-states (gem
prune_states, npm prune-states) is an MVP — loudly — in two layers,
straight out of the patterns above: Ruby and TypeScript, same verdicts. It
hunts the branches that slipped past every human reviewer.
-
provereads a state machine you already have —aasmorstate_machinesin Ruby (≈99% of verifiable Ruby state machines by downloads), or an XState machine in TypeScript — and proves or breaks two kinds of invariant. Single-field ("once X, always X") becomes a one-step check handed to Z3, Microsoft Research's solver: given any transition out of X, is the next state still X? Cross-field ("these two columns can't contradict") composes two machines and hunts for a reachable forbidden joint state. One question, nothing fancier — can the machine reach a state it shouldn't? No timing, no "does it eventually recover," no infinite state; the slice most production machines actually need. Here it is catching the classic single-field bug:$ prune-states OrderBuggy --require app/models/order_buggy.rb \ --assert-terminal refunded Definition: OrderBuggy states: pending, paid, shipped, delivered, refunded invariants: 1 (1 asserted) ✗ RefundedIsTerminal — Asserted: once refunded, always refunded. counterexample: kind = "refunded" kind_next = "pending" 1 violation(s), 0 unknown(s), 0 error(s).Exit code 2 — CI catches it the day someone adds a
reactivateevent that re-opens refunded orders. You can run this exact check in your browser: the playground evaluates the same encoding live and shows the generated SMT-LIB.But the single-field "once X, always X" check isn't the dangerous one. The bugs that actually cost money are cross-field: two status columns on one record that quietly contradict each other. An invoice marked
voidedwhile itspayment_statusstill readssucceeded— money captured against a cancelled bill, invisible to the ledger. A return item markedaccepted(and refunded) while itsreception_statusis stillawaiting— cash out the door for goods that never came back.provechecks these by composing the two machines and asking whether the forbidden joint state is reachable — and if it is, handing back the shortest sequence of events that gets there:✗ VoidedIsNeverPaid — voided and payment=succeeded must not co-occur. counterexample (reachable from the initial state): status = draft, payment_status = pending status = finalized, payment_status = pending status = voided, payment_status = pending status = voided, payment_status = succeeded ← the bug, as a reachable pathI didn't invent those two examples. I pointed the tool at two widely-deployed open-source money systems — a billing engine and a commerce platform — and it surfaced exactly these contradictions. Neither is a unit-test toy: both reproduce at the deployed layer — a payment webhook handler and an admin controller route — red-to-green in the projects' own test suites, each closed by a one-line guard. (A candidate-miner does the finding: point it at a model's status fields and it ranks the pairs worth checking, so a human confirms a short list instead of reading every model.)
It runs on the real machines, not sanitized snippets — including booting Solidus's metaprogrammed
define_state_machine!to verify theOrdergraph that actually ships — and it's honest about its own limit: one invariant comes back a false positive because arecalculatecallback re-transitions behind the graph's back, invisible to declaration-level extraction. (Full dogfood runs — including GitLab'sCi::Pipeline— are in the repo.) -
cover— the Step 3 tool: turns a2^Nflag space into the dozen-odd covering-array tests the NIST interaction rule says actually matter. Pure TypeScript and Ruby — no PICT, no Java, no native binary — with an independently-verified coverage report.
It's all intentionally small. The point was never the perfect tool — it's to make these cuts land in your codebase before you write the next boolean. Because every boolean you write today is a Friday night you won't get back.
The political problem
Everything so far is technique: discriminated unions, covering arrays, database constraints, model checkers. All useful. None of them solve the actual problem.
The actual problem is that someone in a meeting says "what if we let users decide?" and nobody in the room does the math on what that sentence costs in state space. The actual problem is that the PM wants "just one more option" and doesn't realize that one option doubles the configurations QA needs to verify. The actual problem is every system whose flexibility was its selling point and ended up being its downfall.
Your job as a programmer — the part they don't teach you, the part that isn't writing code — is to push back. Not because you're lazy. Not because you don't want to build things. Because you're the only person in the room who understands that complexity isn't linear. Two features don't make a system twice as complex. They make it four times as complex. Three features? Eight times. And nobody notices until it's too late, because each feature looks simple on its own.
This isn't unique to software. Every profession that's been around long enough develops an immune system against requested complexity. Editors cut the chapter the writer loves most, because every darling makes the rest of the book harder to read. Doctors refuse the antibiotic the patient Googled, because every prescription carries interactions the patient isn't pricing. The professional knows what the client doesn't: additions have invisible costs, and the costs compound. Programmers do the same job in a different vocabulary — everyone around you has a structural incentive to add; yours is to cut, and to back it up with the math.
And let's be honest about what "push back" costs, because the advice is cheap and the act isn't: it's political capital, spent against an incentive structure that rewards the person shipping the feature, not the person counting its states. I've watched correct math lose to a roadmap commitment. That's not a reason to skip the math — it's the reason to bring a number instead of an opinion. "I don't like this design" is dismissible. "This change costs 96 new states and here's the bug class they breed" at least forces someone to own the trade.
Better yet, make the math automatic. Give the state space a budget, the
way you budget bundle size: a CI bot that comments on the PR — "this change
adds two booleans to CheckoutProps: 32 representable states → 128."
One caveat from everyone who has watched bundle-size bots die: a gate
gets disabled the first time it blocks a roadmap feature. Ship it as
information, not enforcement — a number on the diff, a champion senior
enough to keep it normal — and it survives long enough to change what
"normal" means.
The data migration is cheaper than what comes after — cheaper than the bugs you can't reproduce because you can't reconstruct the state, cheaper than the escalations from configurations nobody anticipated, cheaper than the team you eventually staff to answer "what is this customer's setup even supposed to look like?"
Simple systems are not a luxury. They're a survival strategy.
Start cutting
You don't need the whole ladder on day one. Pick one cut and make it.
Today. Find one model with three-plus status booleans, or a status
string with no CHECK behind it. Replace the booleans with a discriminated
union, or add the constraint. One bad state that can no longer be spelled —
the cheapest cut there is, and you'll feel it.
This week. Take the cross-field invariant your app assumes but never
enforces — "a voided invoice is never paid," "an active subscription has a
customer" — and push it into the database: a CHECK, a partial unique index,
an EXCLUDE. The cut no service can route around.
This month. Point prune-states at a state machine you own and let it tell you which branch you forgot. Or give your state space a budget — the CI comment from the last section — so the next doubling lands on the diff instead of in production.
Each cut is small. Each deletes a class of bug for good. Stack a few and your weekends stop getting interrupted.
Steal these prompts. Drop one into your coding agent — Claude Code, Codex, Cursor. It reads your repo, so there's nothing to paste:
Find a model with several boolean or status fields and refactor it into a
discriminated union so the invalid combinations can't be constructed. Add the
Postgres CHECK / partial index that enforces the same invariant at the database
layer. Tell me which states you just deleted.Pick a model with two status fields and list every pair of values that's
representable but must never co-occur (e.g. voided + paid). For each, add the
assertion or test that fails the moment that combination becomes reachable.Find a cluster of feature flags read together. Which combinations are nonsense
or never tested? Collapse them into one typed config, and generate a
covering-array test set that hits every remaining pair.In Claude Code those three are one command:
/plugin marketplace add zernie/prune-states, then
/prune-states:make-states-unrepresentable — it finds the worst offender in
your repo and refactors it. The model proposes; the compiler and the CHECK
constraint judge.
We opened with a timeline forking out of control — yours, and now the model's. You can't read every branch it writes; that ship has sailed. But you were never going to win by reading faster. You win by making the bad branches impossible: a state that can't be spelled is a bug that can't ship, no matter who — or what — is holding the keyboard.
Prune the timeline. One cut at a time.
References
- Alexis King, Parse, Don't Validate (2019)
- Yaron Minsky, Effective ML (Jane Street, Effective ML talks at CUFP mid-2000s and the "Effective ML Revisited" blog post) — the OCaml-community version of "make illegal states unrepresentable," predating the F# write-up by a decade
- Scott Wlaschin, Making Illegal States Unrepresentable (F# for Fun and Profit) — the popular F#/TypeScript-era restatement
- Chris Krycho, Making Illegal States Unrepresentable in TypeScript
- Sean Goedecke, "Make invalid states unrepresentable" considered harmful — the counterargument the "When not to" section engages
- David Harel, Statecharts: A Visual Formalism for Complex Systems (Science of Computer Programming, 1987) — the source XState's hierarchical and parallel states descend from
- Doug Seven, Knight Capital — A DevOps Cautionary Tale
- SEC, In the Matter of Knight Capital Americas LLC (Order 34-70694, 2013) — primary source for the $440M trading loss and the $12M penalty
- Michael Barr, Toyota's Killer Firmware
- CNBC, Robinhood to pay $70 million for outages and misleading customers (2021) — the FINRA order behind the missing-invariant example
- Tianyin Xu et al., Hey, You Have Given Me Too Many Knobs! (FSE 2015) — misconfiguration as a leading cause of production outages
- Murali Krishna Ramanathan et al., Piranha: Reducing Feature Flag Debt at Uber (ICSE 2020)
- D. R. Kuhn, D. R. Wallace, A. M. Gallo, Software Fault Interactions and Implications for Software Testing (IEEE TSE 2004) — the NIST interaction rule behind Step 3
- Unleash, When Feature Flags Interact
- Hillel Wayne, Learn TLA+ — the modern on-ramp
- Hillel Wayne, Why Don't People Use Formal Methods? (2019) — the spec-writing barrier the "expensive half" section argues has now collapsed
- Chris Newcombe et al., How Amazon Web Services Uses Formal Methods (CACM 2015)
- Marc Brooker et al., Systems Correctness Practices at Amazon Web Services (CACM 2025) — the ten-years-later follow-up: P, property-based testing, and deterministic simulation alongside TLA+
- Leslie Lamport, The TLA+ Home Page
- Can LLMs Model Real-World Systems in TLA+? (ACM SIGOPS, 2026) — where the "Etcd spec that was really the Raft paper's appendix" comes from; why the TLC check, not the model, is ground truth