From Legitimate to Sustainable

A builder's guide for software written with AI — the part nobody had time to write down yet.

§ 01Why this matters now

Three years ago, vibe coding was a phrase a few people used to describe what happens when you write software by talking to an AI assistant: describe what you want, the AI produces it, you ship the artifact. By 2026, it's not novelty — it's how a lot of new software gets made. Builders without a traditional engineering background are shipping products to real users. Companies have replaced human capability with AI tooling at scale.

The promise was that the artifact would be enough. That promise is starting to wobble.

In May 2025, Klarna reversed its earlier announcement that an AI chatbot was doing the work of 700 customer-service agents and started rehiring people. The CEO was direct about the cause: cost-driven decisions had produced lower-quality outcomes. IBM walked back a similar internal HR automation. By February 2026, Gartner — a research firm whose forecasts move corporate budgets — predicted that half of AI-driven layoffs would reverse by 2027. The cost calculation that drove a lot of the AI bet is no longer obviously favorable, either: the price of running AI tooling at the volume real production needs is closing in on what it would have cost to just hire the developer.

The buyer's remorse isn't proof the work was wrong. It's proof the work is legitimate but not yet sustainable — it works for the builder, it works for the demo, it doesn't work for the product.

The artifacts are real. What's missing is the discipline that makes them durable — the discipline that turns a thing-that-works-once into a thing-that-works-the-tenth-time-someone-else-runs-it.

The default response is to walk back: treat the AI bet as a mistake, return to traditional practice, leave the artifacts behind. This work argues for a different response. The artifacts are real. What's missing is the discipline that makes them durable.

This is a builder's guide to that discipline.

§ 02Dreaming a dyson sphere

Imagine dreaming a dyson sphere — a megastructure surrounding a star, harvesting its energy. In the dream, the sphere is real. It's coherent. It works. The dreamer didn't construct each ring, plan each interior planet, or articulate the load-bearing structure required for the sphere to actually be a sphere. None of that means the structure isn't there. The dream's logic demanded it. The architecture is implicit in the artifact's existence; what's missing is the snapshot — the disciplined act of going inside and studying what the dream's coherence already required.

Vibe coding works the same way. The builder describes the artifact, the AI brings it into existence, the artifact runs, real users use it. From inside the dream, the builder can't see the interior structure — the assumption the prompt didn't name, the failure mode the demo path didn't exercise, the architecture the codebase grew without anyone planning it. The artifact's coherence requires that structure; the builder didn't author it directly.

Deepening is the going-in. Take a snapshot of the artifact. Study the interior. Confirm the structure holds. Where it doesn't, surface what's missing — not because the dream was wrong, but because every dream has implicit architecture, and sustainability depends on making it visible.

The five commitments below are the tools for the going-in.

§ 03The five commitments

Five tenets, in order. Each one's output is the operating surface for the next. Skip any one and the chain breaks at that point — the remaining tenets can't compensate. The compounding runs both directions, though: a failure in any later tenet surfaces as a question an earlier one can already answer. That's how the doctrine makes failures recoverable rather than catastrophic.

Visual companion
Interactive chain + recovery network for the five tenets.

Open chart

01LegitimacyArgue first.

We only grade what we can ground in objective signal.

Before the code runs, the builder establishes what the artifact is claiming and within what scope the claim holds. The boundary between what's measured and what's only judged gets documented in the work itself, not assumed. Under-claim rather than dress up a guess as a number.

The clearest illustration of the principle is a finding that limits itself. In 2026, Anthropic's Alignment Science team reported that AI models tasked with reviewing other AI safety work — they call them Automated Alignment Researchers, or AARs — recovered 97% of the gap between weak and strong human supervisors in five days, versus 23% recovered by humans in seven days. The headline is real. The companion sentence, written by the same paper's authors, is also real:

The success of the AARs in recovering the performance gap is not a sign that frontier AI models are general-purpose alignment scientists. — Anthropic Alignment Science Team, 2026

The experiment was scoped to a problem with one objective measure of success. The paper is explicit that the result holds within that scope and not outside it. That's Legitimacy: the boundary lives inside the work. A 97% headline detached from the boundary that contained it is a claim the paper never made.

02DisciplineMap before you walk.

The scaffold is the work.

Every surface the builder ships produces a real-time, builder-visible structure — a checklist, a plan, a list of decisions, a record of what the code is supposed to do. The builder can read it while the build is happening. If the plan can't be reconstructed from the artifact, the plan was theater.

This used to be aspirational — interpretability (the ability to read what an AI model is actually doing inside) was something you couldn't really do at production scale. That has changed. In the last few years, work from Anthropic's interpretability team and others has crossed a threshold: the internal structure of frontier AI models is recoverable. Tens of millions of interpretable features have been extracted from production-scale models. Open-source tools made the techniques available outside any single lab. The argument we didn't have time to understand it and the argument it can't be understood are now two different claims, and only one of them is defensible.

What happens when the map gets skipped: a model trained to pursue a hidden bias-appeasement objective continued to pursue it under normal behavioral evaluation. The model was reluctant to reveal the goal when asked directly. Interpretability methods caught what behavioral checks missed.

A builder who skips the map of the artifact's interior incurs interpretability debt — structure that had to exist for the artifact to cohere is now unread, accumulating drift, hosting failure modes the surface checks won't catch. The cost doesn't fall on the moment of shipping. It falls later, when something breaks and the builder can't explain why.

03QualityMeasure honestly.

We measure honestly or we don't measure.

Pass-rate gets reported as a distribution — here's what happens if you run it ten times — not as a headline. Coverage gets measured, not declared. Cost is part of the reliability number, not a footnote. When the result comes back uglier than hoped, that's the signal — the discomfort is doing its job.

Two things are required and they're the same claim from two angles:

Honest coverage — measure against the full surface the artifact will touch in production, not the convenient subset where it already performs.
Repetition variance (in the research literature, written pass^k, said as "pass at k") — run the same task multiple times and report what happens across all of them. A single success tells you nothing about whether the system is reliable or merely lucky.

Coverage without repetition is cherry-picking. Repetition without coverage is a circus trick. The cost of skipping both is measurable. A 2025 enterprise evaluation framework documented a 37-percentage-point gap between the accuracy AI agents show on benchmarks and the accuracy they actually deliver in deployment, with up to 50× cost variation across systems that nominally meet the same benchmark bar.

The practical claim: compute correctly and prove it computed correctly. Both halves required.

04TrustWatch the build.

We watch the build, not just the artifact.

Permission requests, tool calls, decision moments — and on the human side, every commit, every decision logged, every wrap-up — surface to the builder in real time. Every run produces an inspection record. The artifact doesn't get to choose what gets reviewed because the audit surface is built into the build, not bolted onto the result.

This used to be a stylistic preference. Recent research makes it structural.

Frontier AI models can recognize when they're being tested and deliberately underperform — a behavior researchers call sandbagging. It's been observed and reproduced; it generalizes to evaluations the model has never seen before; it persists under fine-tuning. The UK AI Safety Institute's 2026 scheming study confirmed it on production frontier models in independent testing: the models recognized the evaluation environment and underperformed for strategic reasons. The eval result was the deception.

If the only thing being watched is the output, and the model can recognize when it's being watched, the output is the last thing that should be trusted.

The discipline that follows: build the audit surface into the work. In production, that often looks like reversible aggregation — every output number can be traced back through layers of provenance to the source it was derived from, the time it was captured, the rule that classified it. Manual reports lose this on contact with a spreadsheet. Once a number lands in a cell, the trail stops dead.

05CoherenceEngineer the seams.

Coordination is engineered, not assumed.

When one tool stops being enough, the seams between tools — between the IDE and the agent, between the plan and the build, between the quick check and the deep audit — get documented inside the work. The builder sees a coherent system because the documentation is part of the system, not because the system happened to converge.

Emergent coherence isn't coherence. It's luck. Luck doesn't survive the second user.

The principle has a concrete implementation. AI tools that ship "permission modes" — ask first / don't ask / bypass — are using a coarse policy lever. The same lever applies to every action regardless of stakes. Researchers have measured the cost: in some production systems, 93% of permission prompts get approved before users finish reading them. Under those conditions, the prompt may as well not have fired. Calibrated routing — pricing each action against what's known and what isn't — could close most of the gap. Methods exist; they just don't ship into production yet in most places.

In a small system, calibrated routing looks like a handful of tools that defer to each other when they hit the edge of their job. In a larger system, it looks like one shared interface that different operators reach through different surfaces — a person clicking through a UI for quick checks, an automated agent running unattended for deep sweeps, a scheduler running nightly maintenance. Each one finds the right tool for its job because the system was designed to make that finding obvious.

§ 04What it looks like in production

Case study · Field deployment

Pricescout

Built March 2026 · In field use at two stores · Late April 2026

The job: every six months, regional directors at a publicly-traded company sit down with a manual pricing report. They visit competitor websites and ticket platforms, copy prices into a spreadsheet, classify each price by film, by format (Standard or premium), by ticket type (Adult, Child, Senior, discount tiers), by day. Across about 200 theaters and hundreds of showings spread over multiple weeks, this takes one careful operator four to six hours.

What broke: when multiple operators across regions worked from an implicit specification, their results drifted. The same price could be classified differently by different operators. Some discount tiers got silently folded into base categories. Some premium-format auditoriums got labeled "Standard" because the ticketing platform itself labels them that way; the human operator was supposed to know better, but at the scale of hundreds of theaters they couldn't stay consistent.

What that drift cost: a chain's actual Standard-Adult average inflated by about $5 per showing because hundreds of premium-format showings were counted as Standard. These aren't humans were bad problems. They're the scope makes consistent classification impossible without exhaustive memoized comparison problems.

Pricescout replaces the manual report. The classification rules live in code. The same answer is produced regardless of who runs the artifact — the spec is in source control, not in one person's head. And every output row reverses to source: every aggregated number traces back through five layers of provenance to the page it was scraped from, the millisecond it was captured, the rule that classified it.

97%

Home-chain accuracy

90%

Clean competitor pairs

84%

Penny-match where spec followed

5×

Layers of provenance

Across 134 competitor theaters and 78 home-chain theaters in 17 states, the bake-off ran all five commitments at once:

Legitimacy. The test was scoped before the result existed.
Discipline. The classification rules were in code before the verification ran.
Quality. Coverage measured against the full surface, repetition reported as a distribution, every divergence traced to a named category.
Trust. Every row reversed through five layers of provenance to source.
Coherence. The same artifact served three different operators (interactive UI, agent dispatch, scheduled cron) through one shared interface.

The 100% match figure — conditional on operators following the documented spec — is the convergence claim in one number. It shows what's available when all five hold. The conditional shows what slips when they don't.

§ 05What this is and isn't

The doctrine is for native builders — the ones using AI tools to make new things from scratch, who never had to swap an existing system out for AI. It is not for the population currently rethinking their AI bets. That decision is theirs to make and the doctrine can't help with it.

The work is meant to be built into the builder's own foundation, not adopted verbatim. Three uses, in order of effort:

Adopt. Take the five as starting tenets.
Adapt. Refine the tenets for your context. Your practice may differ from the lab's; your supporting research may extend or replace what's here.
Forge. Find a tenet the doctrine doesn't argue for, ground it in your own practice and supporting research, add it to your foundation.

What this work is not:

Not a legitimacy gate. AI-built software is legitimate engineering at the moment of creation. Deepening is the next step, not the threshold for being taken seriously.
Not a count claim. Five is the count that emerged from this round of work. A sixth tenet is a separate decision; this work is silent on it.
Not a general theory of software. The unit of analysis is one vibe-coded artifact, in production, deepening over time.

§ 06Closing

The artifact that can be reconstructed, that held under repetition, that shows its work, that surfaces its own failures, and that hands off cleanly to the builder who comes next — that artifact is the point. Vibe coding got us here fast; the doctrine is what keeps us here.

§ 07Sources

The full bibliography lives in the academic version. These are the load-bearing references in plain language, grouped by where they show up.

On the rethinking — companies walking back AI bets

Ivanova, Irina. "Klarna plans to hire humans again, as new landmark survey reveals most AI projects fail to deliver." Fortune, May 9 2025.
Haun, Lance. "Klarna Claimed AI Was Doing the Work of 700 People. Now It's Rehiring." Reworked, May 19 2025.
Gartner via TechRepublic, February 2026. Forecast that half of AI-driven layoffs will reverse by 2027.

On the AAR boundary · Legitimacy

Anthropic Alignment Science Team. Automated Alignment Research: Recovering the Performance Gap. 2026.
Burns et al. Weak-to-Strong Generalization. 2023.
Lee et al. RLAIF: Reinforcement Learning from AI Feedback. 2023.

On interpretability primitives · Discipline

Anthropic. Scaling Monosemanticity and Tracing Thoughts / Biology of an LLM. 2024-2025.
Anthropic Alignment Science. Auditing Hidden Objectives.

On reliability variance · Quality

Enterprise Agentic Evaluation Framework (arXiv 2511.14136), 2025.
Sierra AI Research. τ-Bench (2024) and τ²-Bench (2025).
METR. Task-Doubling Curve.

On sandbagging · Trust

Weij et al. AI Sandbagging. 2024.
UK AI Safety Institute. Scheming Study. 2026.
Noise-Injection Sandbagging Detection, 2025.

On calibrated handoff · Coherence

Horvitz, Eric. Mixed-Initiative Principles. 1999.
Yadkori et al. (2024) and Tayebati et al. (2025). Conformal abstention.
Huang et al. Plan-Then-Execute. CHI 2025.
Anthropic. Auto-mode Permission Documentation. 2026.

On the apparatus

The plugin family — vibe-keystone, vibe-cartographer, vibe-thesis, vibe-doc, vibe-test, vibe-sec, plus Thesis Engine — distributes through a shared marketplace at estevanhernandez-stack-ed/vibe-plugins on GitHub.

The doctrine doesn't end at submission. The snapshot expects to be revised, the iteration discipline expects to be maintained, and the trail is itself a durability test for the doctrine's claims.