Ernie

The Feedback Loop Is All You Need

aiengineeringfeedback-loopsdeveloper-tools

Introduction

So Claude Code added CRON a few days ago. Recurring tasks, native, built right in. Schedule an agent, go to sleep, wake up to results. Your daily token limit actually getting used instead of going to waste.

(I use Claude Code, so the examples here come from that world — but the pattern is the same whether you're in Cursor, Copilot, Codex, or Devin.)

And I'm sitting here like… I can't even use this. Not on the real codebase. Not at work. I'm at Archive.com — we're five years old and already on our third design system. Started on Shopify's Polaris, switched to Ant Design when we outgrew Shopify, now migrating to shadcn/ui and Tailwind because Ant Design became its own kind of legacy. Five years, three UI frameworks, conventions that live in people's heads, business rules no one ever wrote down. You point an agent at that, it'll run. It'll produce code. Beautiful, idiomatic, unholy code — the kind that imports from all three design systems in one file and somehow passes every check you have.

So what do you do? You can't review everything. You can't slow down the agents. And you definitely can't trust them to just figure out which design system to use.

This article is about what actually works.


The old loop vs the new one

For most of my career, the developer loop looked like this: write or review code, spot smells by experience, leave comments explaining intent, and promise to fix things "later" — which usually meant never.

Agents break that loop completely. When code can be produced nonstop, manual review becomes the weakest link. So the loop has to change: encode rules once, let agents iterate against them, observe what fails, tighten the constraints. Less "remember this next time," more "this literally cannot happen."


The real enemy: silent drift

The most dangerous failure mode in agent-driven systems isn't obvious breakage — it's silent drift. Code that compiles, passes every test, looks perfectly reasonable in review — and quietly violates the architectural assumptions you thought were safe.

A trivial example: the agent adds a form to a page you migrated to shadcn/ui. It reaches for Ant Design's <Form.Item> — because that's what the other form on the page still uses. It compiles. It renders. Your migration just went backwards by one component, and nothing in your pipeline noticed.

Or watch it happen with CSS. The agent writes a new component using Tailwind utilities — correct, per your current standard. But it copies a padding value from an old Ant Design component next door: p-[24px] instead of your spacing scale p-6. One magic number won't kill you. Fifty will. Each commit looked fine in isolation. The drift was invisible until it wasn't.

Humans catch this through intuition. Agents don't. They need deterministic, immediate signals. Without them, you're just sending "still broken" for the fifteenth time.

AI IDE after I send "Still broken" for the 15th time Source: ProgrammerHumor.io

The entire game is reducing the distance between wrong change and clear failure.


Skills are helpful — enforcement is mandatory

Most developers who use AI heavily are still in the old reality. CLAUDE.md. Skill libraries. Document annotations. Write it down clearly enough, they think, and the agent will follow. Even Vercel shipped a skill library — 40+ React performance rules, beautifully written, structured as SKILL.md files for AI agents. Sounds like it should work, right?

It won't work. Not reliably. You're shipping too much code for humans to catch all of it.

The code is more what you'd call guidelines than actual rules

CLAUDE.md is the pirate code. And we're betting the codebase on the hope that a probabilistic system will get lucky every single time. Sometimes it does. Sometimes you get beautiful, idiomatic code on the first try. And sometimes it quietly imports from the wrong design system and nobody notices for three weeks.

Here's the thing: we already solved this problem. We spent decades learning that "just write good code" doesn't scale. We invented unit tests because humans forget edge cases. We invented linters because humans disagree on style. We invented CI because "it works on my machine" stopped being funny after the third production outage. Every one of those tools exists because good intentions don't survive contact with a real codebase.

And now we're doing the exact same thing with LLMs. "Just write a really good CLAUDE.md." "Just add more skills." It's the same magical thinking, just with fancier technology. We already know how this ends.

CLAUDE.md explains the why and helps the agent get it right on the first try. A lint rule makes sure it can't get it wrong. Skills speed you up. Linters keep you honest. If you can only have one — take the linter.


Local guardrails

Here's what runs on every change:

  • ESLint — because the agent doesn't have ten years of muscle memory about your import conventions
  • SonarJS — entire bug classes, gone before they start
  • Strict TypeScript — if the types are loose, the agent will find every crack
  • Opinionated React constraints — no "creative" component patterns at 3 AM
  • Prettier — mandatory, non-negotiable, never think about formatting again

Each of these removes a decision the agent could get wrong. The stack doesn't matter — RuboCop does the same for Ruby, Ruff for Python, clippy for Rust. The principle is the same: every enforced rule is one less way the agent can drift. Think of it like functional programming — good code minimizes possible states, and good AI infrastructure does the same. But off-the-shelf linters eliminate syntax-level ambiguity. Architecture-level decisions need something custom.


Custom lint rules: mindset shift

Every time you see something that should never happen again, ask: can this be a lint rule? That question is the real shift — it turns you from a reviewer into an architect who builds permanent guardrails, like having an architecture team that works 365 days a year and never forgets.

The workflow becomes: recurring PR comment → lint rule → never reviewed again. That's the migration path — every convention that lives in a wiki, a CLAUDE.md, or someone's head should be moving toward a lint rule. The documentation doesn't disappear, but it stops being the last line of defense. What lived in backlog now lives in CI. What depended on reviewer attention now fires on every commit.

The barrier is real — the first rule is the hardest. But once the pattern is established, rules accumulate and compound. We had agents adding console.log to production code instead of our custom logger that routes to Datadog. A 10-line lint rule fixed that — forbid console.log, suggest logger.error. Once it's in the linter, the problem is gone forever.

People balk at 50 custom rules. Good — that discomfort is the signal. And some of your rules will be suboptimal. That's fine. Rules improve the same way laws do — someone disagrees, proposes a change, and the system gets better through the argument. Someone hits a rule, gets annoyed, opens a PR to change it — and now you're having the architectural conversation you never had. A codebase with bad rules is in a shape you can improve. A codebase with no rules is just vibes. And when a rule requires migrating existing code, AI + codemods make the cleanup feasible in hours rather than quarters.

Rules catch what you've seen before. But what about failures you haven't imagined?


CI, screenshots, observability

GitHub Actions is the nervous system. Every push triggers the full check — not because I don't trust the agent, but because I don't trust anything that hasn't been verified.

Playwright screenshot tests validate that the UI matches intent — not just that tests pass. The kind of things they catch are invisible to unit tests: a z-index regression that buries a modal behind an overlay, a layout shift from a refactored flex container, a button that renders but is completely unclickable. Chromatic does the same thing for Storybook-based workflows — visual diffs on every PR, no manual QA. Either way, the point is the same: if no one looks at the screen, the screen will break.

Sentry and Datadog feed production signals back into the task queue. When something breaks at 2 AM, it becomes a task, not a mystery.

Now you have write-time checks, integration-time checks, and runtime checks. So what happens when you connect them?


Tokens and ROI

Tokens cost money. But the question isn't "is compute expensive?" — it's "what does a missed bug cost?"

CodeRabbit analyzed 470 GitHub PRs and found AI-generated code has 1.7x more bugs than human-written code. 2.74x more security vulnerabilities. They put it bluntly: "We no longer have a creation problem. We have a confidence problem."

So yes, you're paying for tokens. And teams balk at it. They'll debate whether $200/month for CodeRabbit or a Codex seat is "worth it" — while shipping software with 80%+ gross margins. Think about that for a second. Construction runs on 5-10% margins and nobody argues about buying a level. Restaurants run on 3-5% and still pay for health inspections. Software is the most profitable industry in human history — where the entire means of production is a laptop and a chair you already own — and we're haggling over the cost of automated code review.

A senior engineer in the US costs $150-200/hour loaded. A production bug found by a customer costs days of investigation, emergency fixes, and trust you don't get back. A $200/month tool that catches even one of those per quarter has already paid for itself ten times over. The question was never "can we afford the tooling?" — it's "can we afford not to?"

Every tightening of the feedback loop multiplies what the agent can ship autonomously. That's not a cost center. That's leverage.

As Karpathy put it: "The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software." That's the deal. You don't get the leverage for free — you get it by investing in the feedback infrastructure that makes leverage safe.


The organism

This is what the whole thing was building toward. Put it all together and the system becomes self-tightening:

    ┌─────────────────────────────────────────┐
    │                                         │
    ▼                                         │
  Agent ──▶ Rules ──▶ CI ──▶ Observability    │
                                  │           │
                                  ▼           │
                                Tasks ────────┘

Here's how it works in practice. An agent opens a PR. A custom lint rule catches a barrel-file violation — the agent fixes it. CI runs Playwright; a screenshot shows a layout shift — the agent adjusts the CSS. Sentry reports an increase in 404s on the staging deploy — a new task is created. The agent picks it up. Each failure tightened the system. No human typed a line of code.

Every bug that reaches CI becomes a rule that prevents the next one. The system doesn't just resist failure — it gets stronger from it. That's not a toolchain. That's an organism.

This isn't theoretical. Spotify built a background coding agent called Honk on top of feedback infrastructure they'd been investing in since 2022 — three years before the AI part. Result: 650+ agent-generated PRs merged to production per month. Devin's merge rate doubled from 34% to 67% when they improved codebase understanding — not the model, the context. The pattern is the same everywhere you look: the teams that win aren't running better models. They're running tighter loops.

Where are you?

LevelWhat it looks likeThe tell
0 — VibesNo custom linting, no CI, you review everything manually"My eyes are the only thing between the agent and production"
1 — GuardrailsStandard linters + CI, but no custom rules"The agent passes lint but still drifts architecturally"
2 — Architecture as CodeCustom lint rules encoding your team's conventions"CLAUDE.md rules are migrating into the linter"
3 — The OrganismSelf-tightening loop: agent → rules → CI → observability → tasks → agent"I schedule agents overnight and review diffs in the morning"

If you're at Level 0, no shame — that's where everyone starts. The point isn't to leap to Level 3 overnight. It's to know what you're building toward, and to start with the first custom rule that makes your specific codebase more honest.


Start here

You don't need to build the whole organism in a weekend. You need to start one layer and let it compound.

Today: Pick the PR comment your team keeps leaving — the one about import conventions, or barrel files, or console.log in production — and turn it into a custom lint rule. That's your first piece of architecture-as-code.

This week: Add Playwright screenshot tests for your three most critical pages. You'll be surprised what they catch that unit tests miss.

This month: Schedule a CRON task for something safe — dependency updates, test suite maintenance, stale branch cleanup. Let the agent run overnight. Review the PR in the morning. Start with Claude Code web or Codex web; when that's not enough, a cheap VPS gives you more power for the same idea.

The test: If you can delegate a task from your phone, review the diff on a commute, and trust the result — your feedback loop is tight enough. The unlock isn't working less. It's being untethered from the machine entirely.

I've published a companion repo — agent-lint — with working examples of everything in this article: Claude Code skills that audit your feedback loop maturity and generate custom lint rules from PR comments, an ESLint plugin with the exact rules described above, and a GitHub Actions template for the CI nervous system.


Closing

Three design systems in five years. The agent doesn't know which one it should be using — unless you tell it, deterministically, on every commit.

That's the whole insight, really. The agent that felt idle wasn't waiting for a better model — it was waiting for better sensors. LLMs are probabilistic by nature — they'll get it right most of the time, which on a real codebase means not enough of the time. No amount of prompt engineering changes that. It's not a flaw you can fix; it's the architecture. So stop chasing clever prompts. Chase boring, deterministic, tedious feedback — the kind that fires whether or not anyone is watching, the kind that doesn't care how confident the model was. Linters don't sleep. They don't hallucinate. They don't drift. That's the point.