March 27, 2026

The Feedback Loop Is All You Need

aiengineeringfeedback-loopsdeveloper-tools

Introduction

So Claude Code added CRON a few days ago. Recurring tasks, native, built right in. Schedule an agent, go to sleep, wake up to results. Your daily token limit actually getting used instead of going to waste.

(I use Claude Code, so the examples here come from that world — but the pattern is the same whether you're in Cursor, Copilot, Codex, or Devin.)

And I'm sitting here like… I can't even use this. Not on the real codebase. Not at work. I'm at Archive.com — we're five years old and already on our third design system. Started on Shopify's Polaris, switched to Ant Design when we outgrew Shopify, now migrating to shadcn/ui and Tailwind because Ant Design became its own kind of legacy. Five years, three UI frameworks, conventions that live in people's heads, business rules no one ever wrote down. You point an agent at that, it'll run. It'll produce code. Beautiful, idiomatic, unholy code — the kind that imports from all three design systems in one file and somehow passes every check you have.

So what do you do? You can't review everything. You can't slow down the agents. And you definitely can't trust them to just figure out which design system to use.

This article is about what actually works.

The old loop vs the new one

For most of my career, the developer loop looked like this: write or review code, spot smells by experience, leave comments explaining intent, and promise to fix things "later" — which usually meant never.

Agents break that loop completely. When code can be produced nonstop, manual review becomes the weakest link. So the loop has to change: encode rules once, let agents iterate against them, observe what fails, tighten the constraints. Less "remember this next time," more "this literally cannot happen."

The real enemy: silent drift

The most dangerous failure mode in agent-driven systems isn't obvious breakage — it's silent drift. Code that compiles, passes every test, looks perfectly reasonable in review — and quietly violates the architectural assumptions you thought were safe.

A trivial example: the agent adds a form to a page you migrated to shadcn/ui. It reaches for Ant Design's <Form.Item> — because that's what the other form on the page still uses. It compiles. It renders. Your migration just went backwards by one component, and nothing in your pipeline noticed.

Or watch it happen with CSS. The agent writes a new component using Tailwind utilities — correct, per your current standard. But it copies a padding value from an old Ant Design component next door: p-[24px] instead of your spacing scale p-6. One magic number won't kill you. Fifty will. Each commit looked fine in isolation. The drift was invisible until it wasn't.

Humans catch this through intuition. Agents don't. They need deterministic, immediate signals. Without them, you're just sending "still broken" for the fifteenth time.

AI IDE after I send "Still broken" for the 15th time Source: ProgrammerHumor.io

The entire game is reducing the distance between wrong change and clear failure.

Skills are helpful — enforcement is mandatory

Most developers who use AI heavily are still in the old reality. CLAUDE.md. Skill libraries. Document annotations. Write it down clearly enough, they think, and the agent will follow. Even Vercel shipped a skill library — 40+ React performance rules, beautifully written, structured as SKILL.md files for AI agents. Sounds like it should work, right?

It won't work. Not reliably. You're shipping too much code for humans to catch all of it.

The code is more what you'd call guidelines than actual rules

CLAUDE.md is the pirate code. And we're betting the codebase on the hope that a probabilistic system will get lucky every single time. Sometimes it does. Sometimes you get beautiful, idiomatic code on the first try. And sometimes it quietly imports from the wrong design system and nobody notices for three weeks.

Here's the thing: we already solved this problem. We spent decades learning that "just write good code" doesn't scale. We invented unit tests because humans forget edge cases. We invented linters because humans disagree on style. We invented CI because "it works on my machine" stopped being funny after the third production outage. Every one of those tools exists because good intentions don't survive contact with a real codebase.

And now we're doing the exact same thing with LLMs. "Just write a really good CLAUDE.md." "Just add more skills." It's the same magical thinking, just with fancier technology. We already know how this ends.

CLAUDE.md explains the why and helps the agent get it right on the first try. A lint rule makes sure it can't get it wrong. Skills speed you up. Linters keep you honest. If you can only have one — take the linter.

Local guardrails

Here's what runs on every change:

ESLint — because the agent doesn't have ten years of muscle memory about your import conventions
SonarJS — entire bug classes, gone before they start
Strict TypeScript — if the types are loose, the agent will find every crack
Opinionated React constraints — no "creative" component patterns at 3 AM
Prettier — mandatory, non-negotiable, never think about formatting again

Each of these removes a decision the agent could get wrong. The stack doesn't matter — RuboCop does the same for Ruby, Ruff for Python, clippy for Rust. The principle is the same: every enforced rule is one less way the agent can drift. Think of it like functional programming — good code minimizes possible states, and good AI infrastructure does the same. But off-the-shelf linters eliminate syntax-level ambiguity. Architecture-level decisions need something custom.

Custom lint rules: mindset shift

Every time you see something that should never happen again, ask: can this be a lint rule? That question is the real shift — it turns you from a reviewer into an architect who builds permanent guardrails, like having an architecture team that works 365 days a year and never forgets.

The workflow becomes: recurring PR comment → lint rule → never reviewed again. That's the migration path — every convention that lives in a wiki, a CLAUDE.md, or someone's head should be moving toward a lint rule. The documentation doesn't disappear, but it stops being the last line of defense. What lived in backlog now lives in CI. What depended on reviewer attention now fires on every commit.

The barrier is real — the first rule is the hardest. But once the pattern is established, rules accumulate and compound. We had agents adding console.log to production code instead of our custom logger that routes to Datadog. A 10-line lint rule fixed that — forbid console.log, suggest logger.error. Once it's in the linter, the problem is gone forever.

People balk at 50 custom rules. Good — that discomfort is the signal. And some of your rules will be suboptimal. That's fine. Rules improve the same way laws do — someone disagrees, proposes a change, and the system gets better through the argument. Someone hits a rule, gets annoyed, opens a PR to change it — and now you're having the architectural conversation you never had. A codebase with bad rules is in a shape you can improve. A codebase with no rules is just vibes. And when a rule requires migrating existing code, AI + codemods make the cleanup feasible in hours rather than quarters.

Rules catch what you've seen before. But what about failures you haven't imagined?

CI, screenshots, observability

GitHub Actions is the nervous system. Every push triggers the full check — not because I don't trust the agent, but because I don't trust anything that hasn't been verified.

Playwright screenshot tests validate that the UI matches intent — not just that tests pass. The kind of things they catch are invisible to unit tests: a z-index regression that buries a modal behind an overlay, a layout shift from a refactored flex container, a button that renders but is completely unclickable. Chromatic does the same thing for Storybook-based workflows — visual diffs on every PR, no manual QA. Either way, the point is the same: if no one looks at the screen, the screen will break.

Sentry and Datadog feed production signals back into the task queue. When something breaks at 2 AM, it becomes a task, not a mystery.

Now you have write-time checks, integration-time checks, and runtime checks. So what happens when you connect them?

Tokens and ROI

Tokens cost money. But the question isn't "is compute expensive?" — it's "what does a missed bug cost?"

CodeRabbit analyzed 470 GitHub PRs and found AI-generated code has 1.7x more bugs than human-written code. 2.74x more security vulnerabilities. They put it bluntly: "We no longer have a creation problem. We have a confidence problem."

So yes, you're paying for tokens. And teams balk at it. They'll debate whether $200/month for CodeRabbit or a Codex seat is "worth it" — while shipping software with 80%+ gross margins. Think about that for a second. Construction runs on 5-10% margins and nobody argues about buying a level. Restaurants run on 3-5% and still pay for health inspections. Software is the most profitable industry in human history — where the entire means of production is a laptop and a chair you already own — and we're haggling over the cost of automated code review.

A senior engineer in the US costs $150-200/hour loaded. A production bug found by a customer costs days of investigation, emergency fixes, and trust you don't get back. A $200/month tool that catches even one of those per quarter has already paid for itself ten times over. The question was never "can we afford the tooling?" — it's "can we afford not to?"

Every tightening of the feedback loop multiplies what the agent can ship autonomously. That's not a cost center. That's leverage.

As Karpathy put it: "The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software." That's the deal. You don't get the leverage for free — you get it by investing in the feedback infrastructure that makes leverage safe.

The organism

This is what the whole thing was building toward. Put it all together and the system becomes self-tightening:

    ┌─────────────────────────────────────────┐
    │                                         │
    ▼                                         │
  Agent ──▶ Rules ──▶ CI ──▶ Observability    │
                                  │           │
                                  ▼           │
                                Tasks ────────┘

Here's how it works in practice. An agent opens a PR. A custom lint rule catches a barrel-file violation — the agent fixes it. CI runs Playwright; a screenshot shows a layout shift — the agent adjusts the CSS. Sentry reports an increase in 404s on the staging deploy — a new task is created. The agent picks it up. Each failure tightened the system. No human typed a line of code.

Every bug that reaches CI becomes a rule that prevents the next one. The system doesn't just resist failure — it gets stronger from it. That's not a toolchain. That's an organism.

This isn't theoretical. Spotify built a background coding agent called Honk on top of feedback infrastructure they'd been investing in since 2022 — three years before the AI part. Result: 650+ agent-generated PRs merged to production per month. Devin's merge rate doubled from 34% to 67% when they improved codebase understanding — not the model, the context. The pattern is the same everywhere you look: the teams that win aren't running better models. They're running tighter loops.

Where are you?

Level	What it looks like	The tell
0 — Vibes	No custom linting, no CI, you review everything manually	"My eyes are the only thing between the agent and production"
1 — Guardrails	Standard linters + CI, but no custom rules	"The agent passes lint but still drifts architecturally"
2 — Architecture as Code	Custom lint rules encoding your team's conventions	"CLAUDE.md rules are migrating into the linter"
3 — The Organism	Self-tightening loop: agent → rules → CI → observability → tasks → agent	"I schedule agents overnight and review diffs in the morning"

If you're at Level 0, no shame — that's where everyone starts. The point isn't to leap to Level 3 overnight. It's to know what you're building toward, and to start with the first custom rule that makes your specific codebase more honest.

Start here

You don't need to build the whole organism in a weekend. You need to start one layer and let it compound.

Today: Pick the PR comment your team keeps leaving — the one about import conventions, or barrel files, or console.log in production — and turn it into a custom lint rule. That's your first piece of architecture-as-code.

This week: Add Playwright screenshot tests for your three most critical pages. You'll be surprised what they catch that unit tests miss.

This month: Schedule a CRON task for something safe — dependency updates, test suite maintenance, stale branch cleanup. Let the agent run overnight. Review the PR in the morning. Start with Claude Code web or Codex web; when that's not enough, a cheap VPS gives you more power for the same idea.

The test: If you can delegate a task from your phone, review the diff on a commute, and trust the result — your feedback loop is tight enough. The unlock isn't working less. It's being untethered from the machine entirely.

I've published a companion repo — agent-lint — with working examples of everything in this article: Claude Code skills that audit your feedback loop maturity and generate custom lint rules from PR comments, an ESLint plugin with the exact rules described above, and a GitHub Actions template for the CI nervous system.

Closing

Three design systems in five years. The agent doesn't know which one it should be using — unless you tell it, deterministically, on every commit.

That's the whole insight, really. The agent that felt idle wasn't waiting for a better model — it was waiting for better sensors. LLMs are probabilistic by nature — they'll get it right most of the time, which on a real codebase means not enough of the time. No amount of prompt engineering changes that. It's not a flaw you can fix; it's the architecture. So stop chasing clever prompts. Chase boring, deterministic, tedious feedback — the kind that fires whether or not anyone is watching, the kind that doesn't care how confident the model was. Linters don't sleep. They don't hallucinate. They don't drift. That's the point.

27 марта 2026 г.

Фидбек — это всё, что тебе нужно

aiengineeringfeedback-loopsdeveloper-tools

Введение

Несколько дней назад в Claude Code завезли CRON. Регулярные задачи, нативно, прямо из коробки. Запланировал агента, лёг спать, проснулся — результат готов. Дневной лимит токенов наконец-то расходуется, а не пропадает впустую.

(Я пользуюсь Claude Code, поэтому примеры отсюда — но паттерн одинаковый, будь ты в Cursor, Copilot, Codex или Devin.)

И я сижу такой… я даже не могу это использовать. Не в проде. Не на работе. Я в Archive.com — нам пять лет, и мы уже на третьей дизайн-системе. Начинали с Shopify Polaris, перешли на Ant Design, когда выросли из Shopify, теперь мигрируем на shadcn/ui и Tailwind, потому что Ant Design сам стал легаси. Пять лет, три UI-фреймворка, конвенции, которые живут в головах у людей, бизнес-правила, которые никто никогда не записывал. Натравишь на это агента — он побежит. Выдаст код. Красивый, идиоматичный, адский код — тот, что импортирует из всех трёх дизайн-систем в одном файле и каким-то чудом проходит все проверки.

И что делать? Всё ревьюить невозможно. Замедлять агентов нельзя. И уж точно нельзя надеяться, что они сами разберутся, какую дизайн-систему использовать.

Эта статья — о том, что реально работает.

Старый цикл vs новый

Большую часть карьеры цикл разработчика выглядел так: пишешь или ревьюишь код, замечаешь код-смеллы по опыту, оставляешь комментарии с объяснением замысла, обещаешь починить «потом» — что обычно значило «никогда».

Агенты ломают этот цикл полностью. Когда код генерируется непрерывно, ручное ревью становится самым слабым звеном. Значит, цикл должен измениться: зафиксировать правила один раз, дать агентам итерировать по ним, наблюдать, что падает, ужесточать ограничения. Меньше «запомни на следующий раз», больше «это физически не может произойти».

Настоящий враг: незаметная деградация

Самая опасная проблема в системах с агентами — это не явная поломка, а незаметная деградация. Код, который компилируется, проходит все тесты, выглядит совершенно разумно на ревью — и молча нарушает архитектурные договорённости, которые ты считал нерушимыми.

Простой пример: агент добавляет форму на страницу, которую ты мигрировал на shadcn/ui. Он тянет <Form.Item> из Ant Design — потому что другая форма на этой же странице всё ещё его использует. Компилируется. Рендерится. Твоя миграция только что откатилась на один компонент, и ничего в пайплайне этого не заметило.

Или смотри, как это происходит с CSS. Агент пишет новый компонент на Tailwind-утилитах — правильно, по текущему стандарту. Но копирует значение отступа из старого Ant Design-компонента по соседству: p-[24px] вместо шкалы p-6. Одно магическое число не убьёт. Пятьдесят — убьют. Каждый коммит выглядел нормально по отдельности. Деградация была невидима, пока не стала очевидной.

Люди ловят это интуицией. Агенты — нет. Им нужны детерминированные, мгновенные сигналы. Без них ты просто отправляешь «всё ещё сломано» в пятнадцатый раз.

AI IDE after I send "Still broken" for the 15th time Source: ProgrammerHumor.io

Вся игра — в том, чтобы между ошибкой и фейлом прошло как можно меньше времени.

Skills полезны — enforcement обязателен

Большинство разработчиков, активно использующих AI, всё ещё живут в старой реальности. CLAUDE.md. Библиотеки скиллов. Аннотации в документах. Опишешь достаточно чётко, думают они, и агент всё сделает. Даже Vercel выпустил библиотеку навыков — 40+ правил React-производительности, красиво написанных, структурированных как SKILL.md-файлы для AI-агентов. Звучит так, будто должно работать, да?

Не будет. Не надёжно. Ты шипишь слишком много кода, чтобы люди могли всё отловить.

The code is more what you'd call guidelines than actual rules

CLAUDE.md — это пиратский кодекс. И мы ставим кодовую базу на надежду, что вероятностная система будет попадать в яблочко каждый раз. Иногда да. Иногда получаешь красивый, идиоматичный код с первой попытки. А иногда она тихо импортирует из неправильной дизайн-системы, и никто не замечает три недели.

Вот в чём дело: мы уже решили эту проблему. Мы десятилетиями учились, что «просто пишите хороший код» не масштабируется. Мы изобрели юнит-тесты, потому что люди забывают эдж-кейсы. Мы изобрели линтеры, потому что люди не могут договориться о стиле. Мы изобрели CI, потому что «у меня на машине работает» перестало быть смешным после третьего инцидента на проде. Каждый из этих инструментов существует, потому что благие намерения не выживают при встрече с реальной кодовой базой.

И теперь мы делаем ровно то же самое с LLM. «Просто напиши очень хороший CLAUDE.md.» «Просто добавь больше skills.» Это то же магическое мышление, только с более модной технологией. Мы уже знаем, чем это заканчивается.

CLAUDE.md объясняет «зачем» и помогает агенту сделать правильно с первого раза. Lint-правило гарантирует, что он не сделает неправильно. Skills ускоряют. Линтеры не дают соврать. Если можно выбрать только одно — бери линтер.

Локальные guardrails

Вот что запускается на каждое изменение:

ESLint — потому что у агента нет десяти лет набитой руки с твоими конвенциями импортов
SonarJS — целые классы багов исчезают ещё до старта
Строгий TypeScript — если типы нестрогие, агент найдёт каждую трещину
Строгие React-ограничения — никаких «креативных» паттернов компонентов в 3 часа ночи
Prettier — обязательно, без обсуждений, больше никогда не думай о форматировании

Каждый убирает одно решение, в котором агент мог бы накосячить. Стек не важен — RuboCop делает то же для Ruby, Ruff для Python, clippy для Rust. Принцип один: каждое закреплённое правило — это на один способ деградации меньше. Думай об этом как о функциональном программировании — хороший код минимизирует возможные состояния, и хорошая AI-инфраструктура делает то же самое. Но готовые линтеры устраняют синтаксическую неоднозначность. Архитектурные решения требуют чего-то кастомного.

Кастомные lint-правила: сдвиг мышления

Каждый раз, когда видишь что-то, что не должно повториться, спроси: можно ли из этого сделать lint-правило? Этот вопрос — и есть настоящий сдвиг. Он превращает тебя из ревьюера в архитектора, который строит постоянные guardrails — как команда архитектуры, работающая 365 дней в году и никогда ничего не забывающая.

Воркфлоу простой: повторяющийся PR-комментарий → lint-правило → больше никогда не ревьюится. Вот путь миграции — каждая конвенция, которая живёт в вики, в CLAUDE.md или в чьей-то голове, должна двигаться в сторону lint-правила. Документация никуда не девается, но перестаёт быть последней линией обороны. То, что жило в бэклоге, теперь живёт в CI. То, что зависело от внимательности ревьюера, теперь срабатывает на каждом коммите.

Порог входа реальный — первое правило самое трудное. Но как только паттерн установлен, правила накапливаются и дают кумулятивный эффект. У нас агенты добавляли console.log в продакшн-код вместо нашего кастомного логгера, который шлёт логи в Datadog. Lint-правило на 10 строк это починило — запретить console.log, предложить logger.error. Как только это в линтере, проблема исчезает навсегда.

Люди пугаются 50 кастомных правил. Хорошо — этот дискомфорт и есть сигнал. И некоторые правила будут неидеальными. Это нормально. Правила улучшаются так же, как законы — кто-то не согласен, предлагает изменение, и система становится лучше через спор. Кто-то натыкается на правило, раздражается, открывает PR — и вот у тебя происходит архитектурная дискуссия, которой раньше не было. Кодовая база с плохими правилами — в состоянии, которое можно улучшить. Кодовая база без правил — это просто vibes. А когда правило требует миграции существующего кода, AI + кодмоды делают очистку возможной за часы, а не за кварталы.

Правила ловят то, что ты уже видел. А как насчёт сбоев, которые ты ещё не представлял?

CI, скриншоты, observability

GitHub Actions — это нервная система. Каждый пуш запускает полную проверку — не потому что я не доверяю агенту, а потому что я не доверяю ничему, что не было верифицировано.

Playwright-скриншот-тесты валидируют, что UI соответствует замыслу — не просто что тесты проходят. Они ловят вещи, невидимые для юнит-тестов: регрессия z-index, которая хоронит модалку под оверлеем, сдвиг лейаута из-за рефакторинга flex-контейнера, кнопка, которая рендерится, но абсолютно некликабельна. Chromatic делает то же самое для Storybook-воркфлоу — визуальные диффы на каждый PR, никакого ручного QA. В любом случае суть одна: если никто не смотрит на экран, экран сломается.

Sentry и Datadog отдают продакшн-сигналы обратно в очередь задач. Когда что-то падает в 2 ночи, это становится задачей, а не загадкой.

Теперь у тебя есть проверки на этапе написания, на этапе интеграции и в рантайме. Что будет, если их соединить?

Токены и ROI

Токены стоят денег. Но вопрос не «дорого ли вычисление?», а «сколько стоит пропущенный баг?»

CodeRabbit проанализировал 470 GitHub PR и обнаружил, что AI-сгенерированный код содержит в 1.7 раза больше багов, чем написанный людьми. В 2.74 раза больше уязвимостей безопасности. Они сформулировали прямо: «У нас больше нет проблемы создания. У нас проблема уверенности».

Так что да, ты платишь за токены. И команды жмутся. Они будут спорить, стоит ли $200/месяц за CodeRabbit или за Codex — при этом выпуская софт с валовой маржой 80%+. Задумайся на секунду. Строительство работает на марже 5-10%, и никто не спорит о покупке строительного уровня. Рестораны работают на 3-5% и всё равно платят за санитарные проверки. Софт — самая прибыльная индустрия в истории человечества, где всё средство производства — это ноутбук и стул, которые у тебя уже есть, — и мы торгуемся из-за стоимости автоматизированного код-ревью.

Senior-инженер в США обходится компании в $150-200/час. Баг на продакшне, найденный клиентом, стоит дни расследования, экстренные фиксы и доверие, которое фиг вернёшь. Инструмент за $200/месяц, который ловит хотя бы один такой баг в квартал, уже окупился десятикратно. Вопрос никогда не стоял «можем ли мы позволить себе инструменты?» — вопрос «можем ли мы позволить себе без них?»

Каждое ужесточение фидбек-лупа умножает то, что агент может шипить автономно. Это не центр затрат. Это рычаг.

Как сказал Карпати: «Цель — получить рычаг от использования агентов, но без компромиссов в качестве софта». Вот сделка. Рычаг не даётся бесплатно — ты получаешь его, инвестируя в фидбек-инфраструктуру, которая делает рычаг безопасным.

Организм

Вот к чему всё это вело. Собери всё вместе — и система начнёт сама себя ужесточать:

    ┌─────────────────────────────────────────┐
    │                                         │
    ▼                                         │
  Agent ──▶ Rules ──▶ CI ──▶ Observability    │
                                  │           │
                                  ▼           │
                                Tasks ────────┘

Вот как это работает на практике. Агент открывает PR. Кастомное lint-правило ловит нарушение barrel-файлов — агент исправляет. CI запускает Playwright; скриншот показывает сдвиг лейаута — агент корректирует CSS. Sentry сообщает о росте 404-х на staging-деплое — создаётся новая задача. Агент подхватывает её. Каждый сбой ужесточил систему. Ни один человек не написал ни строчки кода.

Каждый баг, дошедший до CI, становится правилом, предотвращающим следующий. Система не просто сопротивляется сбоям — она становится сильнее от них. Это не тулчейн. Это организм.

Это не теория. Spotify построили фонового кодинг-агента Honk поверх фидбек-инфраструктуры, в которую инвестировали с 2022 года — за три года до AI-части. Результат: 650+ PR от агентов, смёрженных в продакшн ежемесячно. Merge rate у Devin удвоился с 34% до 67%, когда улучшили понимание кодовой базы — не модель, а контекст. Паттерн везде одинаковый: побеждают не те команды, которые запускают лучшие модели. Побеждают те, у кого более плотный фидбек-луп.

Где ты сейчас?

Уровень	Как это выглядит	Признак
0 — Vibes	Нет кастомного линтинга, нет CI, всё ревьюишь вручную	«Мои глаза — единственное, что стоит между агентом и продакшном»
1 — Guardrails	Стандартные линтеры + CI, но без кастомных правил	«Агент проходит линт, но всё равно деградирует архитектурно»
2 — Architecture as Code	Кастомные lint-правила, кодирующие конвенции команды	«Правила из CLAUDE.md мигрируют в линтер»
3 — The Organism	Самоужесточающийся цикл: агент → правила → CI → observability → задачи → агент	«Я ставлю агентов на ночь и ревьюю диффы утром»

Если ты на уровне 0, не стыдно — все там начинают. Суть не в том, чтобы прыгнуть на уровень 3 за ночь. Суть — знать, к чему ты строишь, и начать с первого кастомного правила, которое сделает твою конкретную кодовую базу честнее.

С чего начать

Не нужно строить весь организм за выходные. Нужно начать с одного слоя и дать ему нарастать.

Сегодня: Возьми PR-комментарий, который твоя команда оставляет снова и снова — про конвенции импортов, или barrel-файлы, или console.log в продакшне — и преврати его в кастомное lint-правило. Это твой первый кусочек architecture-as-code.

На этой неделе: Добавь Playwright-скриншот-тесты для трёх самых критичных страниц. Удивишься, что они поймают, чего не ловят юнит-тесты.

В этом месяце: Запланируй CRON-задачу на что-нибудь безопасное — обновление зависимостей, поддержание тестов, чистка заброшенных веток. Пусть агент работает ночью. Ревьюй PR утром. Начни с Claude Code web или Codex web; когда этого станет мало, дешёвый VPS даст больше мощности для той же идеи.

Тест: Если ты можешь делегировать задачу с телефона, ревьюить дифф в дороге и доверять результату — твой фидбек-луп достаточно плотный. Фишка не в том, чтобы работать меньше. А в том, чтобы быть полностью отвязанным от машины.

Я опубликовал companion-репо — agent-lint — с рабочими примерами всего из этой статьи: Claude Code skills, которые аудитят зрелость твоего фидбек-лупа и генерируют кастомные lint-правила из PR-комментариев, ESLint-плагин с точно описанными выше правилами и шаблон GitHub Actions для нервной системы CI.

Заключение

Три дизайн-системы за пять лет. Агент не знает, какую из них использовать — если ты не скажешь ему детерминированно, на каждом коммите.

Вот и весь инсайт, по сути. Агент, который вроде бы простаивал, ждал не лучшую модель — он ждал лучшие сенсоры. LLM по природе вероятностны — они угадают в большинстве случаев, а на реальной кодовой базе это значит — недостаточно часто. Никакой prompt engineering этого не изменит. Это не баг, который можно починить; это архитектура. Так что хватит гнаться за умными промптами. Гонись за скучным, детерминированным, занудным фидбеком — тем, что срабатывает независимо от того, смотрит ли кто-то, тем, которому плевать, насколько уверена была модель. Линтеры не спят. Они не галлюцинируют. Они не деградируют. В этом суть.