June 5, 2026

The invoice was dead. Then it took your money.

aitypescriptsoftware-engineeringstate-managementfeature-flagsformal-methodspostgresformal-verificationrails

When a type can't make the cut
The spec got cheap to write — not cheap to trust
The branches that survived review
Where the cheap cut stops
The political problem
Start cutting
References

When a type can't make the cut

The invoice was voided. The money was already captured.

The type was right. Every test was green. There was no database constraint forbidding it — not because anyone decided to skip one, but because nobody knew this state could happen. It lived in the timing: void the invoice, then a payment webhook arrives late, confirming a capture already in flight. The webhook handler checked payment_succeeded? but not voided?. The write went through. The ledger showed nothing owed; a real charge sat unreconciled.

I found it by asking a checker whether that joint state was reachable. The path it handed back — draft → finalized → voided → voided + paid — was a sequence nobody had drawn on a whiteboard, and no test had ever simulated. That's the whole move: the checker doesn't just enforce the fence, it tells you which fence to build — which forbidden combination is actually reachable, so you know to go add the CHECK for it.

Types delete bad states. Constraints delete bad combinations. But you can only constrain the states you know exist — and neither touches the order events arrive in.

This is Part 2. Part 1 covered the two cheapest cuts — a discriminated union, a database constraint — that make illegal states unspellable. Here we handle the states you can't delete, only verify.

But some bugs don't fit in a type or a CHECK. Maybe the state space is genuinely huge. Sometimes the business won't let you shrink it. Or the rule is about the order moves happen in, not any single row. So you keep cutting, with a small ladder of verification tools that each cost more and buy a stronger guarantee than the last. They look unrelated — a state machine, a pairwise generator, a model checker — but every one is asking the same question with more rigor: can a bad state happen, and how sure are we? You don't run all four. You climb as far as your bugs make you and stop at the first rung that catches them. Cheapest first:

A four-step ladder labelled bottom to top — state machines, property tests, combinatorial, model checking — with blobs climbing and pruning dead twigs that thin out toward the leafy top: each step up is a stronger cut.

Step 1 — State machines

Someone sets isPublished to true. Forgets to clear isArchived. The type allows it, nothing checks it, and the "archived" banner appears over a live listing. The boolean got set; the invariant didn't.

A state machine makes that move unreachable before it happens — not by checking the value after the fact, but by declaring which moves exist at all. Draft can become published. Published can be archived. Archived can't go back to draft. Ever. Nothing in the codebase needs to "remember" the rule; the machine refuses anything not in the table, and there's no table entry for "published and archived at once."

The machine is more confident about this than the roadmap.

▶Step 1 in Ruby (AASM) and TypeScript (XState)

# Ruby with AASM
class Order < ApplicationRecord
  include AASM
  aasm do
    state :draft, initial: true
    state :published, :archived
 
    event(:publish) { transitions from: :draft, to: :published }
    event(:archive) { transitions from: [:draft, :published], to: :archived }
    # Note: no transition back from :archived. Once archived, terminal.
  end
end
 
order.publish!   # OK
order.archive!   # OK from draft or published
archived_order.publish!  # AASM::InvalidTransition — the machine said no

So now the machine is the documentation and the enforcement at once. The reachable states are finite, they live in one place, and anything not in the table just gets refused. You write the rules down once and the machine runs them for you. The graph is the type.

Step 2 — Sample the space

When the space is too big to list and you can't shrink it further, you sample it — two ways, one blunt and one sharp.

The blunt one is property-based testing, and I ignored it for years — writing good properties always felt like more work than just picking a few inputs by hand. That was a mistake. The pitch is simple: instead of "input X → output Y," you write down a property that should hold for any input, and the framework throws hundreds of random cases at it trying to break it — fast-check (TypeScript), Hypothesis (Python), PropCheck (Ruby); the stateful variants even walk random sequences of moves. It runs circles around the three cases you'd have picked yourself. But random has a blind spot: the one pair that sinks you is exactly the spot no throw ever lands.

I've shipped the other half of that blind spot, too: a bug that lived only in a state our test data never built. Our content came in two shapes with different time rules, and the seed scripts only ever generated one — so the other branch was never once exercised in CI, and an exception that could only happen there shipped green. The state space our tests explored was a strict subset of the one prod ran in.

The blob hurls a handful of knives at a big leafy tree — random test inputs. Several knives strike dead twigs and slice them off, but a few bare dead twigs are left untouched where no knife landed — the cases the random throws missed. Catches some, misses others: sampling's blind spot.

The sharp alternative fixes that blind spot by construction — and the number is almost offensive.

Twenty feature flags is 1,048,576 possible configurations. Testing every pair of them takes ten test cases.

1,048,576 → 10. Every pair of twenty flags, covered in ten configs.

Not a thousand. Ten. That's not a trick — decades of fault data say failures cluster in low-order interactions. NIST's analysis of real bugs found that testing every pair catches 70–97% of bugs, and across every fault dataset they studied, none needed more than six factors interacting at once. So you don't cover a million configs; you cover the pairs, and the pairs fit in ten rows. A covering array is a small set of configs that guarantees every pair of values appears at least once. Need to catch a specific three-way combination? Bump the strength to t = 3 and it lands in the rows by construction.

Don't take it on faith — there's a real two-flag bug hiding below. Test six configs at random and watch it sometimes ship green; then run the six-row covering array and watch it get caught every time:

Covering arrays

32 configs → 6

dark-mode and pdf-export each work fine. Turn on both and the PDF renders white-on-white — invisible. The bug hides in one pair of flags, among 32 configs. Can you catch it?

dark-modepdf-exportnew-navbeta-bannercompact

1·····

21111·🐛

311··1🐛

4··111

51·1··

6·1·1·

1 = on · · = off · 🐛 = the bug

CAUGHT ✓The covering array's 6 configs include every pair, so the bad pair is always in there.

Almost nobody does this: GitLab's open-source repo ships ~500 feature flags, tests them one at a time, with no combinatorial strategy at all — and they're more disciplined about flags than most of us. You're in good company. So is the bug.

▶Generators, constraints, and the one caveat

The generators are off-the-shelf — NIST's own ACTS and Microsoft's PICT on the command line, or a one-line library right inside your test runner.

Three booleans — eight combinations — need just four rows to hit all pairs:

Check it: every pair of columns shows all four of 00, 01, 10, 11. The magic is how that scales. Covering-array size grows with the logarithm of the parameter count, and exponentially only in t — the interaction strength you're covering — not in the number of parameters. That's why twenty flags collapse to a handful of rows instead of a million.

You write one text file — PICT, Microsoft's generator, is the usual default:

dark-mode-for-real:   on, off
ai-everything:        on, off
temporary-fix-2019:   on, off
layout:               single-page, multi-step
 
IF [ai-everything] = "on" THEN [dark-mode-for-real] = "off";

pict model.txt prints the rows; each row is one test config. Constraints (the IF/THEN) keep it from generating combinations you've already made illegal. NIST's own ACTS tool does the same up to 6-way, with a GUI or CLI.

The easiest path skips the shell-out entirely: a covering-array library generates the rows in-process and hands them straight to your test runner's parametrize hook — no binary, no parsing. In Python with allpairspy:

from allpairspy import AllPairs
import pytest
 
flags = {
    "dark_mode_for_real": [True, False],
    "ai_everything":      [True, False],
    "temporary_fix_2019": [True, False],
}
 
@pytest.mark.parametrize("cfg", AllPairs(list(flags.values())))
def test_checkout(cfg):
    dark_mode_for_real, ai_everything, temporary_fix_2019 = cfg
    assert checkout_works(dark_mode_for_real, ai_everything, temporary_fix_2019)

Three flags, all pairs, in the handful of cases the table above predicts — just a decorator, no new infrastructure. Pick whichever fits your stack:

Where you run it	Tool
CLI, any language	PICT · NIST ACTS (≤ 6-way)
Python test runner	allpairspy
TypeScript / browser	covertable · pict-pairwise-testing

One mechanical caveat: pairwise is still sampling. It provably covers every pair, but a fault that only fires on a specific three- or four-way combination can slip through (a medical-device study found a real four-way one). Bump to --o:3 for anything safety-critical; the cost is more rows, not a different tool.

And pairwise is still sampling, not proof — combinatorial testing is the consolation prize. If you can collapse those flags into a discriminated union so the nonsense combinations can't exist, do that; a bug that can't be represented needs no test. Covering arrays are for the irreducible remainder — the legacy config, the third-party surface you don't control — and even there a covering array only guarantees a fixed interaction order across a fixed set of parameters, not every state your system can reach. The four-way bug slips a pairwise array; the bug that needs a specific sequence of transitions was never an "interaction" at all. Which raises the question: what if you could stop sampling — and prove the property holds across every reachable state?

Well. There is.

Step 3 — Model checking (formal verification)

One bug walks through every cut so far. It passes the type checker, survives the property test's few hundred rolls, and reads clean in review — because it isn't in any single line of code. It's in the order two events arrive in, the interleaving nobody drew on the whiteboard. It waits, then pages you at 3am, where it's most expensive to find. This is the bug that survives every test.

Model checking is built for exactly this bug. Property tests sample; a model checker walks the whole space. You hand it a precise description of your system, every state and every move, and it visits every reachable state, then drops the exact sequence that breaks your rule on your desk. There's a catch, and we'll get to it. It only ever explores the description you gave it, not your actual code, so it's only as good as how honestly that description matches what you shipped.

AWS has used TLA+ on S3 and DynamoDB since 2011. seL4 — a kernel proved correct down to the assembly — flies in drones. Their 2025 follow-up is blunt that it didn't scale past a whole toolkit. The lesson: pick the cheapest tool that exhausts your state space.

The smallest useful example fits on a napkin. Take a Todo with four states — active, done, archived, deleted — and one rule: once deleted, always deleted. TLC walks every reachable state and confirms it holds across all of them, not a handful of cases.

The payoff lands when a teammate adds an innocent-looking "undo delete" transition. It compiles fine in any language; a property test might miss it. TLC catches it instantly — and doesn't just say "failed." It hands back the exact shortest sequence that breaks the rule: active → deleted → active. A model checker doesn't return a red X; it returns a movie of how the system breaks. Play it below: void an invoice and a payment already in flight lands late, clawing it out of its grave stamped PAID. Whack the zombies as they pop — you'll miss the ones you weren't watching — then hit the X-ray and watch the model checker prove the one order that pays a dead invoice, every time:

Model checking

Whack-a-Ghost

Void an invoice and it's dead — buried. But a payment already in flight lands late, and the invoice claws out of its grave stamped PAID. Whack the zombies before they escape. (You can't watch all five graves at once…)

See the full spec — open the editor

You don't need to read a line of TLA+ to get this.

On a Todo this looks like overkill — for plain CRUD, a discriminated union already catches almost everything a checker would. That's exactly why it gets waved off as academic, and exactly the wrong lesson: the bugs that survive every cheaper cut don't live in CRUD. They live in the next section's territory.

The spec got cheap to write — not cheap to trust

The cost of formal methods was never the checker — TLC and Z3 are free and finish in seconds at this scale. The cost was the spec: somebody with rare knowledge had to write it. An LLM now drafts it, and a lightweight checker like TLC or Z3 replaces the years-long proof for everything short of a spacecraft.

A little robot blob on wheels hands a finished blueprint scroll labelled "SPEC" to the main blob, who looks pleasantly surprised; scissors and a tidy pile of pruned dead twigs rest nearby — the expensive half used to be writing the spec by hand, and now it arrives done.

That barrier just collapsed, from two directions at once.

From above: LLMs write passable TLA+ now, and there are off-the-shelf Claude Code skills for exactly this loop — write the spec, run TLC, debug the violation. One discipline makes it safe: the checker, not the model, is the ground truth. An LLM will happily emit a plausible spec that's secretly the textbook's Paxos rather than your system. A wrong spec that goes red costs minutes; the quiet risk is the one that goes green, because a checker only vouches for the spec it was given. So generation doesn't remove the human — it moves your job from writing the spec to reading it: minutes, not years.

From below: if your state machine already exists in code, the spec doesn't need to be written or generated — it can be extracted, mechanically, with no model in the loop at all. Your aasm block, your XState machine, your reducer already are the spec; a few hundred lines of extractor lift them into a checker's input format. That's the chore I automated — you can try the idea right now in the browser playground.

Either way the bottleneck moved — it didn't vanish, it changed shape. The LLM drafts the spec, turns your code into valid TLA+, and can even float candidate invariants for a checker to accept or refute. What it won't do is tell you which property is worth asserting — and that's the design act, the actual job. A 2026 benchmark (Can LLMs Model Real-World Systems in TLA+?) of frontier models writing TLA+ from real system code draws the line cleanly: they nail the syntax nearly every time, but a human still has to hand them the invariant to check — and even then the spec faithfully matches the running system less than half the time. The machine writes it down; you still decide what "correct" means and read what came back. That is the part that went from rare expertise to a habit you reach for — an afternoon, not a PhD. (A PhD was always overkill for checking whether a voided invoice can still be paid.)

The branches that survived review

The bugs that actually cost money are cross-field: an invoice marked voided while its payment_status still reads succeeded — money captured against a cancelled bill, invisible to the ledger. No single line is wrong; the two status fields just have no rule forbidding the combination. Compose the two state machines and ask whether that forbidden joint state is reachable, and you get back the shortest path that leads there:

✗ VoidedIsNeverPaid — voided and payment=succeeded must not co-occur.
    counterexample (reachable from the initial state):
      status = draft,     payment_status = pending
      status = finalized, payment_status = pending
      status = voided,    payment_status = pending
      status = voided,    payment_status = succeeded   ← the bug, as a reachable path

That path is the whole point. You get the exact sequence of moves that lands on the illegal state, so the bug shows up with its reproduction steps already attached.

I didn't invent that example. I pointed a small checker at two widely-deployed open-source money systems — a billing engine and a commerce platform — and it surfaced exactly these contradictions. A reachable path is a candidate, not a bug — the checker works from the declared state machine, so a guard it can't see (a callback, a conditional write) can make a "reachable" state unreachable in the running system. That's not hand-waving: it happened to me, and it's the wall this technique hits. So a candidate earns the word "bug" only when you reproduce it against the real thing — which both of these were, each at the deployed layer (a payment-webhook handler, an admin route), each closing with a one-line guard.

And the bug class isn't mine to claim — an ISSTA 2011 study found cross-field data-modeling bugs of exactly this shape in two real Rails apps, fifteen years before I pointed a solver at it. These are latent bugs — surfaced by bounded checking, not live outages. They're also real money paths.

▶"Isn't this just model checking, or a schema linter?"

Fair — and the answer is in the framing. Lifting a model from code and checking it is old: Rails data models were bounded-checked in 2011. But the verifiers you can install today either make you hand-write the spec, or they auto-extract a model only to hunt generic crashes — deadlocks, races, null derefs — not your business status fields. And the "drift" tools that sound similar (prisma migrate diff, active_record_doctor, Hibernate validate) check whether your schema matches the database — not whether your validations match your constraints, a different drift one layer up. The unoccupied bit: read the status machine out of your aasm/XState and prove a forbidden cross-field state (un)reachable.

The check itself is a small open-source tool — extract the state machine you already have, ask whether a bad state is reachable (prune-states). But the tool is incidental; the point is to make these cuts land in your codebase before you write the next boolean.

Where the cheap cut stops

Every static technique here cuts one shape of bug: a state that can't be represented (the union) or can't be reached (the checker) — real, common, and in 2026 nearly free to fence off. But it sits on one side of a wall worth naming.

The dangerous bugs are usually dataflow: a field written on a path that skipped the guard, a callback that re-transitions behind the model's back, a balance recomputed wrong. prune-states hit that wall itself — an invariant came back a false positive because a callback mutated state the declaration-level model couldn't see. The bug isn't a reachable state; it's a write that reached a legal state by the wrong path. Tracking that statically — write-site and consumer analysis — is, in 2026, still a real build, not a one-liner.

So you don't reach for it — you change layers, from compile time to run time, where the path stops mattering. A database CHECK fires on the write itself, so a wrong path landing on an illegal row is caught however it got there — the constraint checks the write, not the path that produced it. What still slips is the wrong-but-legal write — a balance recomputed to a value the constraint accepts — and the concurrency that breeds it. Those aren't shapes you forbid, they're races you make safe: an idempotency key, a transactional outbox, the right isolation level.

So you don't give up at dataflow. You switch tactics: stop trying to prove the path is safe, and start watching the write itself. It's the same job either way. Name the invariant, then hand it to something more reliable than a careful programmer.

The political problem

We were building a CRM, and someone in a product meeting asked: what if we let users configure everything through a visual UI — triggers, actions, even the table schemas? Not just the values, the whole data model. The case was reasonable — customers want flexibility, and letting PMs configure it visually meant non-engineers could own a surface without filing a ticket. Nobody said no, because nobody in the room was thinking about state spaces. They were thinking about user agency, which is a genuinely good thing.

In practice, "users configure the data model" meant every combination of trigger, action chain, and schema was a product state we had to handle — not a large space, an unbounded one, a new topology per customer. We spent months on infrastructure for flexibility nobody needed. And the whole no-code argument that justified it evaporated the moment the tool for non-engineers to write code got good: now a PM describes a schema to an LLM and has it in thirty seconds. The complexity outlasted the argument that built it.

That's the shape of the political problem — not a wrong decision, but a reasonable argument made before anyone pulled a number. Out of the Tar Pit put the cost precisely: every bit of state you add doubles the possible states; complexity compounds, it doesn't sum. But a one-time "this adds unbounded states" argument is spent capital against an incentive structure that rewards whoever shipped the feature, not whoever counted its states. You don't win by arguing harder — you win by making the cost ambient.

Give the state space a budget the way you budget bundle size: a CI bot that comments "this PR adds two booleans to CheckoutProps: 32 states → 128." Ship it as information, not a gate — a gate gets disabled the first time it blocks a roadmap feature; a number in the PR survives long enough to change what "normal" means. It's cheaper than what comes after: bugs you can't reproduce, and the team you eventually staff to answer "what is this customer's setup even supposed to look like?"

Start cutting

You've done the static cuts — Part 1 — or start there first. This is the verification layer.

This week. Take the feature flags your test suite covers one at a time. Generate a covering array. Six rows instead of a million; the pair that breaks you lands in there by construction. If you can delete those flags into a discriminated union instead, do that first — a bug that can't be spelled needs no test.

This month. Pick one state machine you already have — AASM, XState, a reducer — and write down the cross-field invariant it assumes but never enforces. Check whether that state is reachable. By hand, or with a small checker. Give your state space a budget: a CI comment that surfaces "this PR adds two booleans: 32 states → 128" before anyone merges.

Each cut is small. Each deletes a class of bug for good. Stack a few and your weekends stop getting interrupted.

We opened Part 1 with a timeline forking out of control — yours, and now the model's. Two articles later: the bad states can't be spelled, the ones that survive can't be reached. A branch the type can't build can't ship; a state the checker can't reach can't break prod. Then nothing that passed every test still takes down prod.

Prune the timeline. One cut at a time.

References

▶Sources & further reading

Murali Krishna Ramanathan et al., Piranha: Reducing Feature Flag Debt at Uber (ICSE 2020)
D. R. Kuhn, D. R. Wallace, A. M. Gallo, Software Fault Interactions and Implications for Software Testing (IEEE TSE 2004) — the NIST interaction rule behind the covering-array section
Unleash, When Feature Flags Interact
Hillel Wayne, Learn TLA+ — the modern on-ramp
Hillel Wayne, Why Don't People Use Formal Methods? (2019) — the spec-writing barrier the "spec got cheap to write" section argues just dropped
Chris Newcombe et al., How Amazon Web Services Uses Formal Methods (CACM 2015)
Marc Brooker et al., Systems Correctness Practices at Amazon Web Services (CACM 2025) — the ten-years-later follow-up: P, property-based testing, and deterministic simulation alongside TLA+
Leslie Lamport, The TLA+ Home Page
Can LLMs Model Real-World Systems in TLA+? (ACM SIGOPS, 2026) — where the "Etcd spec that was really the Raft paper's appendix" comes from; why the TLC check, not the model, is ground truth
Travis Hance, Marijn Heule, Ruben Martins, Bryan Parno, Finding Invariants of Distributed Systems: It's a Small (Enough) World After All (NSDI 2021) — the first automated safety proof of Paxos
Jaideep Nijjar, Tevfik Bultan, Bounded Verification of Ruby on Rails Data Models (ISSTA 2011) — mechanically extracts Active Record models into Alloy and finds real cross-field data-modeling bugs in two production Rails apps
Maysam Yabandeh, Abhishek Anand, Marco Canini, Dejan Kostić, Finding Almost-Invariants in Distributed Systems (SRDS 2011) — mines properties that almost always hold from live system traces
Ben Moseley, Peter Marks, Out of the Tar Pit (2006) — the canonical argument that state is the primary source of complexity
Chris Hawblitzel et al., IronFleet: Proving Practical Distributed Systems Correct (SOSP 2015) — the 85-line spec / 3.7-person-year proof

5 июня 2026 г.

Инвойс был мёртв. А потом списал твои деньги.

aitypescriptsoftware-engineeringstate-managementfeature-flagsformal-methodspostgresformal-verificationrails

Содержание

Когда тип не тянет подрезку
Писать спеку стало дёшево — доверять ей нет
Ветки, что пережили ревью
Где дешёвая подрезка упирается в стену
Политическая проблема
Начни подрезать
Источники

Когда тип не тянет подрезку

Инвойс аннулировали. Деньги при этом уже списали.

Тип был правильный. Все тесты зелёные. Не было ни одного ограничения в базе, которое бы это запрещало, — и не потому, что кто-то решил его не ставить, а потому что никто не знал, что такое состояние вообще возможно. Он жил во времени: аннулируешь инвойс — а потом с опозданием приходит платёжный вебхук и подтверждает списание, которое уже было в пути. Обработчик вебхука проверял payment_succeeded?, но не voided?. Запись прошла. В реестре — ноль к оплате; а реальное списание висело несведённым.

Я нашёл его, спросив чекер, достижимо ли это совместное состояние. Путь, который он выдал в ответ, — draft → finalized → voided → voided + paid — был последовательностью, которую никто не рисовал на доске и которую не проигрывал ни один тест. В этом весь приём: чекер не просто держит забор — он подсказывает, какой забор строить: какая именно запрещённая комбинация на самом деле достижима, чтобы ты знал, под какую из них добавить CHECK.

Типы удаляют плохие состояния. Ограничения удаляют плохие комбинации. Но ограничить можно только те состояния, о существовании которых ты знаешь, — и ни то ни другое не трогает порядок, в котором приходят события.

Это Часть 2. Часть 1 разобрала две самые дешёвые подрезки — размеченное объединение и ограничение в базе, — которые делают недопустимые состояния невыразимыми. Здесь мы берёмся за состояния, которые нельзя удалить, — только проверить.

Но не всё влезает в тип или CHECK. Пространство состояний бывает и правда огромным. Иногда бизнес не даёт его ужать. А иногда правило — про порядок переходов, а не про одну строку. Поэтому ты режешь дальше — небольшая лестница инструментов верификации, где каждый следующий дороже и покупает гарантию сильнее предыдущего. С виду они несвязанные — конечный автомат, генератор пар, чекер моделей, — но каждый задаёт один и тот же вопрос со всё большей строгостью: может ли случиться плохое состояние и насколько мы в этом уверены? Все четыре гонять не нужно: лезешь вверх ровно настолько, насколько заставляют твои баги, и останавливаешься на первой же ступени, которая их ловит. Сначала — самые дешёвые:

Лестница из четырёх ступеней с подписями снизу вверх — state machines, property tests, combinatorial, model checking — по ней карабкаются блобы и срезают сухие ветки, которые редеют к зелёной верхушке: каждый шаг вверх — подрезка посильнее.

Шаг 1 — Конечные автоматы

Кто-то выставляет isPublished в true. Забывает сбросить isArchived. Тип это позволяет, никто не проверяет — и над живым объявлением вылезает баннер «в архиве». Булево выставили; инвариант — нет.

Конечный автомат делает этот переход недостижимым ещё до того, как он случится, — не проверяя значение задним числом, а объявляя, какие переходы вообще существуют. Черновик может стать опубликованным. Опубликованный — уйти в архив. Архивный вернуться в черновики не может. Никогда. Ничему в кодовой базе не нужно «помнить» правило: автомат отвергает всё, чего нет в таблице, а строки «опубликован и в архиве разом» в таблице попросту нет.

Автомат в этом вопросе уверен куда больше, чем роадмап.

▶Шаг 1 на Ruby (AASM) и TypeScript (XState)

# Ruby с AASM
class Order < ApplicationRecord
  include AASM
  aasm do
    state :draft, initial: true
    state :published, :archived
 
    event(:publish) { transitions from: :draft, to: :published }
    event(:archive) { transitions from: [:draft, :published], to: :archived }
    # Обрати внимание: обратного перехода из :archived нет. Архив — терминальное состояние.
  end
end
 
order.publish!   # OK
order.archive!   # OK из draft или published
archived_order.publish!  # AASM::InvalidTransition — автомат сказал нет

И вот теперь автомат — это разом и документация, и обеспечение. Достижимые состояния конечны, лежат в одном месте, а всё, чего нет в таблице, просто отвергается. Правила ты записываешь один раз — дальше их гоняет за тебя автомат. Граф и есть тип.

Шаг 2 — Сэмплируй пространство

Когда пространство слишком велико, чтобы перечислить его целиком, а ужать сильнее уже нельзя, — его сэмплируют, и есть два способа: один грубый, другой точный.

Грубый — это property-based тестирование, и я сам годами на него забивал: написать хорошие свойства всегда казалось возни больше, чем просто подобрать пару входов руками. И зря. Фишка простая: вместо «вход X → выход Y» ты объявляешь свойство, которое должно держаться для любого входа, а фреймворк забрасывает его сотнями случайных кейсов, пытаясь сломать, — fast-check (TypeScript), Hypothesis (Python), PropCheck (Ruby); stateful-варианты ещё и бродят по случайным последовательностям переходов. Он играючи обходит те три кейса, что ты выбрал бы сам. Но у случайности есть слепое пятно: та самая пара, что тебя топит, — ровно то место, куда ни один бросок так и не попадает.

Другую половину этого слепого пятна я тоже отгружал: баг, который жил только в состоянии, которого наши тестовые данные ни разу не создавали. Контент приходил в двух видах с разными правилами по времени, а сид-скрипты генерировали только один — так что вторая ветка ни разу не прогонялась в CI, и исключение, возможное только в ней, уехало зелёным. Пространство состояний, которое обходили наши тесты, было строгим подмножеством того, в котором работает прод.

Блоб швыряет в большое лиственное дерево горсть ножей — случайные входные данные. Несколько ножей попали в сухие сучья и срезали их, но пара других голых сухих сучьев осталась нетронутой там, куда ни один нож не долетел, — случаи, которые случайные броски прозевали. Что-то срезал, что-то пропустил: слепое пятно сэмплирования.

Точный способ закрывает это слепое пятно по построению — и число почти неприличное.

Двадцать фича-флагов — это 1 048 576 возможных конфигураций. Проверить каждую их пару можно за десять тестов.

1 048 576 → 10. Каждая пара из двадцати флагов — в десяти конфигах.

Не тысяча. Десять. И это не фокус — десятилетия данных о сбоях говорят: сбои кучкуются во взаимодействиях немногих факторов. Разбор реальных багов от NIST показал, что проверка каждой пары ловит 70–97 % багов, и по всем изученным ими наборам данных о сбоях ни одному не требовалось больше шести взаимодействующих факторов разом. Так что покрываешь ты не миллион конфигов, а пары — и пары умещаются в десять строк. Покрывающий массив — это небольшой набор конфигураций, гарантирующий, что каждая пара значений встретится хотя бы раз. Нужно поймать конкретную тройную комбинацию? Подними силу до t = 3 — и она ляжет в строки по построению.

Не верь на слово — ниже спрятан настоящий баг на паре флагов. Прогони шесть случайных конфигов и смотри, как он иногда уезжает зелёным; потом прогони покрывающий массив из шести строк — и он ловится каждый раз:

Покрывающие массивы

32 конфига → 6

dark-mode и pdf-export по отдельности работают. Включи оба — и PDF рендерится белым по белому, невидимка. Баг прячется в одной паре флагов, среди 32 конфигов. Поймаешь?

dark-modepdf-exportnew-navbeta-bannercompact

1·····

21111·🐛

311··1🐛

4··111

51·1··

6·1·1·

1 = вкл · · = выкл · 🐛 = баг

ПОЙМАЛ ✓6 конфигов покрывающего массива включают каждую пару, так что плохая пара всегда внутри.

Почти никто так не делает: открытый репозиторий GitLab несёт около пятисот фича-флагов, тестирует их по одному и не имеет вообще никакой комбинаторной стратегии — а ведь с флагами они аккуратнее большинства из нас. Ты в хорошей компании. Баг тоже.

▶Генераторы, ограничения и одна оговорка

Генераторы готовые: собственный ACTS от NIST и PICT от Microsoft в командной строке — или однострочная библиотека прямо внутри твоего тест-раннера.

Три булевых флага — восемь комбинаций — требуют всего четыре строки, чтобы покрыть все пары:

Проверь: каждая пара колонок даёт все четыре 00, 01, 10, 11. Вся магия в том, как это масштабируется. Размер покрывающего массива растёт с логарифмом числа параметров и экспоненциально только по t — силе взаимодействия, которую ты покрываешь, — но не по числу параметров. Потому двадцать флагов и схлопываются до горстки строк вместо миллиона.

Ты пишешь один текстовый файл — PICT, генератор от Microsoft, обычный выбор по умолчанию:

dark-mode-for-real:   on, off
ai-everything:        on, off
temporary-fix-2019:   on, off
layout:               single-page, multi-step
 
IF [ai-everything] = "on" THEN [dark-mode-for-real] = "off";

pict model.txt печатает строки; каждая строка — это одна тестовая конфигурация. Ограничения (тот самый IF/THEN) не дают генерировать комбинации, которые ты уже объявил нелегальными. Собственный инструмент ACTS от NIST делает то же самое вплоть до 6-way, с GUI или CLI.

Самый простой путь вообще обходится без shell-вызова: библиотека покрывающих массивов генерирует строки прямо в процессе и отдаёт их в parametrize-хук твоего тест-раннера — ни бинарника, ни парсинга. На Python с allpairspy:

from allpairspy import AllPairs
import pytest
 
flags = {
    "dark_mode_for_real": [True, False],
    "ai_everything":      [True, False],
    "temporary_fix_2019": [True, False],
}
 
@pytest.mark.parametrize("cfg", AllPairs(list(flags.values())))
def test_checkout(cfg):
    dark_mode_for_real, ai_everything, temporary_fix_2019 = cfg
    assert checkout_works(dark_mode_for_real, ai_everything, temporary_fix_2019)

Три флага, все пары, в той самой горстке кейсов, что предсказывает таблица выше, — просто декоратор, никакой новой инфраструктуры. Бери то, что подходит твоему стеку:

Где запускаешь	Инструмент
CLI, любой язык	PICT · NIST ACTS (≤ 6-way)
Python тест-раннер	allpairspy
TypeScript / браузер	covertable · pict-pairwise-testing

Одна техническая оговорка: pairwise — это всё ещё сэмплирование. Оно доказуемо покрывает каждую пару, но сбой, который срабатывает только на конкретной трёх- или четырёхпараметрической комбинации, может проскользнуть (в одном исследовании медицинского прибора нашёлся реальный четырёхпараметрический). Для всего, что критично для безопасности, поднимай до --o:3; цена — лишние строки, а не другой инструмент.

И всё-таки pairwise — это сэмплирование, а не доказательство: комбинаторное тестирование — утешительный приз. Если флаги можно схлопнуть в размеченное объединение так, чтобы бессмысленные комбинации просто не могли существовать, — сделай это: багу, который нельзя представить, никакой тест не нужен. Покрывающие массивы — для несжимаемого остатка: легаси-конфига, сторонней поверхности, которую ты не контролируешь, — и даже там покрывающий массив лишь гарантирует фиксированный порядок взаимодействия по фиксированному набору параметров, а не каждое состояние, которого может достичь твоя система. Четырёхпараметрический баг проскальзывает мимо pairwise-массива; а баг, которому нужна конкретная последовательность переходов, и вовсе никогда не был «взаимодействием». И тут напрашивается вопрос: а что, если бы можно было вообще перестать сэмплировать — и доказать, что свойство держится на каждом достижимом состоянии твоей системы?

Что ж. Можно.

Шаг 3 — Проверка моделей (формальная верификация)

Один баг проходит сквозь все предыдущие подрезки. Он проходит проверку типов, переживает пару сотен бросков property-теста и читается на ревью чисто — потому что его нет ни в одной строке кода. Он в порядке, в котором приходят два события, в переплетении, которое никто не рисовал на доске. Он ждёт — а потом будит тебя в 3 ночи, где искать его дороже всего. Это и есть тот самый баг, который переживает все тесты.

Model checking создан именно под этот баг. Property-тесты сэмплируют — чекер моделей обходит всё пространство целиком. Даёшь ему точное описание своей системы, каждое состояние и каждый переход, и он посещает каждое достижимое состояние, а потом кладёт тебе на стол ту самую последовательность, что ломает твоё правило. Есть оговорка, и мы к ней вернёмся. Обходит он только то описание, что ты ему дал, а не твой реальный код, — а значит, хорош ровно настолько, насколько честно это описание совпадает с тем, что ты выкатил.

AWS гоняет TLA+ на S3 и DynamoDB с 2011-го. seL4 — ядро, доказанное корректным вплоть до ассемблера — летает в дронах. Продолжение от AWS 2025 года прямолинейно: это так и не масштабировалось дальше целого набора инструментов. Урок: бери самый дешёвый инструмент, который исчерпывает твоё пространство состояний.

Самый маленький полезный пример умещается на салфетке. Возьми Todo с четырьмя состояниями — active, done, archived, deleted — и одним правилом: раз удалено — удалено навсегда. TLC обходит каждое достижимое состояние и подтверждает, что правило держится, — все до единого, а не горстку кейсов.

Выигрыш виден, когда коллега добавляет безобидный с виду переход «отменить удаление». Компилируется на любом языке; property-тест его может и проворонить. А TLC ловит мгновенно — и не просто говорит «упало». Он отдаёт точную кратчайшую последовательность, которая ломает правило: active → deleted → active. Model checker возвращает не красный крестик, а фильм о том, как система ломается. Поиграй ниже: аннулируй счёт — а платёж, уже бывший в пути, приходит позже и вытаскивает его из могилы со штампом PAID. Прибивай зомби, пока они вылезают, — тех, за кем не уследил, упустишь, — потом жми «Рентген» и смотри, как чекер моделей доказывает порядок, который оплачивает мёртвый счёт, каждый раз:

Проверка моделей

Прибей призрака

Аннулируй счёт — он мёртв, похоронен. Но платёж, уже бывший в пути, приходит позже, и счёт выбирается из могилы со штампом PAID. Прибивай зомби, пока не сбежали. (За всеми пятью могилами разом не уследить…)

Посмотреть полную спецификацию — открыть редактор

Чтобы понять это, не нужно читать ни строчки TLA+.

На Todo это выглядит перебором — для обычного CRUD размеченное объединение и так ловит почти всё, что нашёл бы чекер. Именно поэтому его и отмахивают как «академическое» — и именно в этом ошибка: баги, которые переживают все дешёвые подрезки, в CRUD не живут. Они живут на территории следующего раздела.

Писать спеку стало дёшево — доверять ей нет

Цена формальных методов никогда не была в чекере — TLC и Z3 бесплатны и на этом масштабе отрабатывают за секунды. Цена была в спеке: написать её должен был кто-то с редким знанием. Теперь её набрасывает LLM, а лёгкий чекер вроде TLC или Z3 заменяет многолетнее доказательство для всего, что меньше космического аппарата.

Маленький робот-блоб на колёсиках вручает главному блобу готовый свёрнутый чертёж с подписью «SPEC», и тот приятно удивлён; рядом лежат ножницы и аккуратная горка срезанного сушняка: дорогой половиной было писать спеку вручную, а теперь она приходит готовой.

Этот барьер только что рухнул, причём сразу с двух сторон.

Сверху: LLM теперь пишут вполне сносный TLA+, и есть готовые скиллы Claude Code ровно под эту петлю — напиши спеку, запусти TLC, разбери нарушение. Безопасной её делает одна дисциплина: истина в последней инстанции — чекер, а не модель. LLM с радостью выдаст правдоподобную спеку, которая втихую окажется учебниковым Paxos, а не твоей системой. Неправильная спека, уходящая в красное, стоит минут; настоящий риск — та, что уходит в зелёное, потому что чекер ручается только за ту спеку, которую ему скормили. Так что генерация не убирает человека — она сдвигает твою работу с написания спеки на её чтение: минуты, а не годы.

Снизу: если твой автомат уже существует в коде, спеку не нужно ни писать, ни генерировать — её можно извлечь, механически, вообще без модели в петле. Твой блок aasm, твой XState-автомат, твой редьюсер уже и есть спека; пара сотен строк экстрактора поднимают их в формат входа чекера. Эту рутину я и автоматизировал — саму идею можно попробовать прямо сейчас в браузерном плейграунде.

Так или иначе, узкое место сдвинулось — не исчезло, а сменило форму. LLM набрасывает спеку, превращает твой код в валидный TLA+ и даже подкидывает кандидатов в инварианты, чтобы чекер их принял или опроверг. Чего она не сделает — не скажет тебе, какое свойство вообще стоит утверждать; а это и есть проектная работа, собственно вся работа. Бенчмарк 2026 года (Can LLMs Model Real-World Systems in TLA+?), где топовые модели пишут TLA+ по коду реальных систем, проводит границу чётко: синтаксис они берут почти всегда, но инвариант для проверки им по-прежнему подаёт человек — и даже тогда спека честно совпадает с работающей системой меньше чем в половине случаев. Машина это записывает; а что значит «правильно» и что там получилось — решаешь и вычитываешь по-прежнему ты. Вот эта часть и превратилась из редкой экспертизы в привычку, за которой тянешься, — вечер, а не PhD. (PhD всегда был явным перебором, чтобы проверить: можно ли всё ещё заплатить по аннулированному счёту.)

Ветки, что пережили ревью

Баги, которые реально стоят денег, — кросс-полевые: инвойс помечен voided, а его payment_status всё ещё succeeded — деньги списаны по отменённому счёту, мимо реестра. Ни одна строка не написана неправильно; просто у двух полей статуса нет правила, запрещающего эту комбинацию. Сложи два конечных автомата и спроси, достижимо ли запрещённое совместное состояние, — и получишь кратчайший путь, который туда ведёт:

✗ VoidedIsNeverPaid — voided and payment=succeeded must not co-occur.
    counterexample (reachable from the initial state):
      status = draft,     payment_status = pending
      status = finalized, payment_status = pending
      status = voided,    payment_status = pending
      status = voided,    payment_status = succeeded   ← the bug, as a reachable path

Этот путь — и есть весь смысл. Ты получаешь точную последовательность шагов, которая приводит в нелегальное состояние, так что баг приходит с уже приложенными шагами воспроизведения.

Этот пример я не выдумал. Я навёл небольшой чекер на две широко развёрнутые open-source денежные системы — биллинговый движок и коммерс-платформу — и он вытащил ровно эти противоречия. Достижимый путь — это кандидат, а не баг: чекер работает от объявленного конечного автомата, так что гард, которого он не видит (коллбэк, условная запись), может сделать «достижимое» состояние недостижимым в работающей системе. И это не пустые слова: со мной так и случилось — это та самая стена, в которую упирается вся техника. Так что слово «баг» кандидат заслуживает только тогда, когда ты воспроизвёл его на настоящей системе, — а оба этих как раз воспроизвелись, каждый на развёрнутом слое (обработчик платёжного вебхука, роут админки), каждый закрылся однострочным гардом.

И этот класс багов — не моё открытие: исследование ISSTA 2011 нашло кросс-полевые ошибки модели данных ровно этой формы в двух реальных Rails-приложениях за пятнадцать лет до того, как я навёл на это решатель. Это латентные баги — вскрытые ограниченной проверкой, а не живыми авариями. И это тоже реальные денежные пути.

▶«Так это же просто model checking или линтер схемы?»

Справедливо — и весь смысл в формулировке. Снимать модель с кода и проверять её умели давно: модели данных Rails ограниченно верифицировали ещё в 2011-м. Но верификаторы, которые реально можно установить сегодня, либо заставляют писать спеку руками, либо извлекают модель только чтобы ловить общие краши — дедлоки, гонки, разыменования null, — а не твои бизнес-поля статусов. А похожие на вид инструменты «дрейфа» (prisma migrate diff, active_record_doctor, Hibernate validate) проверяют, совпадает ли схема с базой, — но не то, совпадают ли твои валидации с ограничениями; это другой дрейф, уровнем выше. Незанятое — вычитать автомат статусов из твоего aasm/XState и доказать, что запрещённое кросс-полевое состояние (не)достижимо.

Сам чекер — небольшой open-source инструмент: извлеки конечный автомат, который у тебя уже есть, и спроси, достижимо ли плохое состояние (prune-states). Но инструмент тут дело десятое — смысл в том, чтобы эти подрезки легли в твою кодовую базу раньше, чем ты напишешь следующий булев флаг.

Где дешёвая подрезка упирается в стену

Каждая статическая техника отсюда срезает баг одной формы: состояние, которое нельзя представить (объединение) или которое недостижимо (чекер). Это реально, часто, и в 2026-м отгородить почти ничего не стоит. Но всё это по одну сторону стены, которую стоит назвать.

Опасные баги — обычно про dataflow: поле, записанное на пути, который проскочил гард; коллбэк, втихую переключивший состояние за спиной модели; неверно пересчитанный баланс. prune-states сам упёрся в эту стену — один инвариант дал ложное срабатывание, потому что коллбэк менял состояние, невидимое для модели уровня объявлений. Баг тут — не достижимое состояние, а запись, дошедшая до легального состояния неправильным путём. Отследить это статически — анализ мест записи и потребителей — в 2026-м всё ещё целый проект, а не однострочник.

Поэтому ты за ним и не тянешься — ты меняешь слой, со времени компиляции на время выполнения, где путь перестаёт что-либо значить. CHECK в базе срабатывает на самой записи, так что неправильный путь, приведший к нелегальной строке, ловится, как бы он туда ни попал, — ограничение проверяет запись, а не путь, который её породил. А вот что всё равно проскальзывает — это запись «неправильная, но легальная»: баланс, пересчитанный в значение, которое ограничение пропускает, — и конкурентность, которая его порождает. Это уже не формы, которые ты запрещаешь, — это гонки, которые ты делаешь безопасными: ключ идемпотентности, transactional outbox, правильный уровень изоляции.

Так что на dataflow ты не сдаёшься — ты меняешь тактику: перестаёшь доказывать, что путь безопасен, и начинаешь следить за самой записью. Работа-то одна и та же: назвать инвариант, а потом отдать его чему-то понадёжнее внимательного программиста.

Политическая проблема

Мы строили CRM. Кто-то на продуктовой встрече спросил: а что, если дать пользователям настраивать всё через визуальный интерфейс — триггеры, действия, даже схемы таблиц? Не просто значения — всю модель данных. Аргумент был разумный: клиенты хотят гибкости, а «PM-ы настроят это визуально» означало, что неинженеры смогут владеть куском продукта, не заводя тикет. Никто не сказал нет — потому что никто в комнате не думал о пространствах состояний. Думали о свободе выбора для пользователя, а это вещь и правда хорошая.

На практике «пользователь настраивает модель данных» означало, что каждая комбинация триггера, цепочки действий и схемы — это состояние продукта, которое мы обязаны обрабатывать. И пространство не большое, а неограниченное — своя топология на каждого клиента. Мы месяцами пилили инфраструктуру ради гибкости, которая никому не была нужна. А весь no-code-довод, которым это оправдывали, испарился, стоило инструменту «неинженер пишет код» стать хорошим: теперь PM описывает схему LLM и получает её за тридцать секунд. Сложность пережила аргумент, который её построил.

Вот форма политической проблемы: не ошибочное решение, а разумный аргумент, высказанный до того, как кто-нибудь достал число. Out of the Tar Pit называет цену точно: каждый добавленный бит состояния удваивает число возможных состояний; сложность перемножается, а не складывается. Но разовый довод «это добавляет неограниченно много состояний» — потраченный капитал против структуры стимулов, которая награждает того, кто зашипил фичу, а не того, кто считал её состояния. Эту битву не выиграть, споря громче, — её выигрывают, делая цену фоновой.

Дай пространству состояний бюджет, как ты бюджетируешь размер бандла: CI-бот, который пишет в PR коммент «это изменение добавляет два булевых поля в CheckoutProps: 32 состояния → 128». Шипи его как информацию, а не гейт: гейт отключают в первый же раз, когда он блокирует фичу из роадмапа, а число в PR проживёт достаточно долго, чтобы поменять смысл слова «нормально». Это дешевле того, что приходит потом: багов, которые ты не можешь воспроизвести, и команды, которую ты в итоге наберёшь, чтобы отвечать на вопрос «а как вообще должна выглядеть настройка у этого клиента?».

Начни подрезать

Статические подрезки ты уже сделал — Часть 1 — или начни сперва с них. Это слой верификации.

На этой неделе. Возьми фича-флаги, которые твой тест-сьют гоняет по одному. Сгенерируй покрывающий массив. Шесть строк вместо миллиона; та пара, что тебя ломает, ляжет туда по построению. А если эти флаги можно вместо этого удалить в размеченное объединение — сделай сначала это: багу, который нельзя записать, никакой тест не нужен.

В этом месяце. Возьми один из своих конечных автоматов — AASM, XState, редьюсер — и выпиши кросс-полевой инвариант, который он подразумевает, но нигде не обеспечивает. Проверь, достижимо ли это состояние. Руками или небольшим чекером. Дай своему пространству состояний бюджет: CI-коммент, который до любого мёржа показывает «это изменение добавляет два булевых поля: 32 состояния → 128».

Каждая подрезка мала. Каждая удаляет класс багов навсегда. Сложи несколько — и твои выходные перестанут прерывать.

Часть 1 мы открыли таймлайном, ветвящимся из-под контроля, — твоим, а теперь ещё и модели. Две статьи спустя: плохие состояния нельзя записать, а те, что уцелели, — недостижимы. Ветку, которую тип не может собрать, нельзя зашипить; состояние, до которого чекер не может добраться, не уронит прод. И тогда ничто, прошедшее все тесты, уже не кладёт прод.

Подрежь таймлайн. По одной подрезке за раз.

Источники

▶Источники и что почитать

Мурали Кришна Раманатан и др., Piranha: Reducing Feature Flag Debt at Uber (ICSE 2020)
Д. Р. Кун, Д. Р. Уоллес, А. М. Галло, Software Fault Interactions and Implications for Software Testing (IEEE TSE 2004) — правило взаимодействий NIST за разделом про покрывающие массивы
Unleash, When Feature Flags Interact
Хиллел Уэйн, Learn TLA+ — современная точка входа
Хиллел Уэйн, Why Don't People Use Formal Methods? (2019) — барьер написания спек, про который раздел «писать спеку стало дёшево» утверждает, что он только что просел
Крис Ньюкомб и др., How Amazon Web Services Uses Formal Methods (CACM 2015)
Марк Брукер и др., Systems Correctness Practices at Amazon Web Services (CACM 2025) — продолжение десять лет спустя: P, property-based тестирование и детерминированная симуляция рядом с TLA+
Лесли Лэмпорт, The TLA+ Home Page
Can LLMs Model Real-World Systems in TLA+? (ACM SIGOPS, 2026) — откуда «спека Etcd, что на деле была приложением статьи про Raft»; почему истина в последней инстанции — проверка TLC, а не модель
Трэвис Хэнс, Марейн Хёле, Рубен Мартинс, Брайан Парно, Finding Invariants of Distributed Systems: It's a Small (Enough) World After All (NSDI 2021) — первое автоматическое доказательство безопасности Paxos
Джайдип Ниджар, Тевфик Бултан, Bounded Verification of Ruby on Rails Data Models (ISSTA 2011) — механически извлекает Active Record-модели в Alloy и находит реальные кросс-полевые баги модели данных в двух продакшен-Rails-приложениях
Майсам Ябанде, Абхишек Ананд, Марко Канини, Деян Костич, Finding Almost-Invariants in Distributed Systems (SRDS 2011) — майнит свойства, которые держатся почти всегда, из трасс работающей системы
Бен Мозли, Питер Маркс, Out of the Tar Pit (2006) — канонический довод, что состояние — главный источник сложности
Крис Хоблитцел и др., IronFleet: Proving Practical Distributed Systems Correct (SOSP 2015) — спека в 85 строк / доказательство в 3,7 человеко-года

Contents