Field Note2026-05-29

Same code, 3x the thinking: what I did with my first 8 hours with Opus 4.8.

I turned on max-orchestration to see what would happen. Eight sessions later the agent had burned three times the tokens and written the same amount of code. The extra went into tests it caught before I saw them, and into a run that stopped answering the wheel once it started.

Este Hernandez·9 min read

I turned on Claude Code's ultracode mode, its highest-orchestration setting, the day Opus 4.8 shipped, expecting more output. Eight sessions later the agent had burned 3.2 times the tokens, the units of compute the model spends to think and write, and produced the same number of lines. The other two-thirds did not make more code. They made the code careful.

Ultracode is max-orchestration: instead of one thread doing the work, the agent fans it out to subagents, helper instances it spins up to work pieces in parallel, by default. To isolate what that changed, I compared the eight sessions I ran on it against the eight right before, across the same day-and-a-half window, eight against eight, idle time stripped out so the clock reflects real working time, not a session left open overnight. Both columns are fully measured, pooled across the two machines I work on, so nothing is sampled or estimated.

Eight ultracode sessions vs. the eight right before
Measure	Before	Ultracode	Change
Burn per session	625K	2.02M	3.2×
Active time per session	91 min	171 min	1.9×
Subagents per session	7.9	10.5	1.3×
Lines written per session	2,213	2,062	0.9×
Burn per active hour	413K	708K	1.7×
Lines per active hour	1,461	724	0.5×
Lines per 1M tokens	3,540	1,022	0.3×

Where the tokens actually went

Read the bottom three rows together (higher lines-per-token means more code per unit of thinking). Token effort per hour climbed. Code produced per hour fell by half. Code per token dropped to a third. The agent thought more and shipped less.

Figure 1 · the trade, in one view

Effort up. Output per token, down.

The four measures that rose are all effort: tokens burned, time spent. The three that fell are all output density. The agent worked harder per line, not faster.

The clearest case was a single session that burned 5M tokens across 75 subagents and produced under 5,000 lines. A planning run, almost all of it reasoning, verification, and coordination. A pre-ultracode session burned 3M and wrote 11,000. Burn never measured output. It measures deliberation, and ultracode deliberates hard.

Burn never measured output. It measures deliberation, and ultracode deliberates hard.

How many agents, really

Swarms are not the new part. I have always fanned work out to subagents: the before column shows 7.9 a session, ultracode ran 10.5. A bump, not a leap. The plumbing has been in my own plugins all along. The superpowers skills already spin up a fresh agent per task and review each one before moving on. Organized. Just run by hand.

What changed is who holds the reins. Before, I drove the orchestration turn by turn: dispatch an agent, wait for it to land, choose what comes next. Ultracode hands that loop to a workflow, a script that runs the fleet, stages the work, and gathers the results without checking back with me between moves. And for the first time I can see it work. The week of the build day a live view shipped that draws the whole run as it happens, every agent, its job, what it is waiting on. A subagent used to be a black box that handed back a summary. Now the box has a glass wall.

Figure 2 · what actually changed

From hand-driven to script-driven. From black box to glass wall.

Same subagents as before. What is new: a script holds the loop instead of my hand, and a live view, shipped the week of the build day, finally makes the run watchable.

So the swarm did not arrive with ultracode. The script that drives it did, and the window to watch it through.

The new thing: the marathon

One ultracode session ran 8.6 hours of active time as a single sustained autonomous orchestration, 75 subagents deep. Nothing in the prior window came close. The mode did not make my sessions faster. It made a new kind of session possible: the long, patient, self-checking run I would never have sat through by hand.

And I did not sit through it. That week was the first time I pushed to eight sessions at once, and the marathon was one thread among them. The 8.6 hours were the agent's working time, not mine. I dipped in at the test checkpoints, caught what looked wrong, bounced to the next thread. The new shape is not one long run. It is eight of them braided, with me conducting.

What it feels like to drive

The one thing that changed was the wheel. Inside the tight window the raw capability felt about the same, but the longer runs got hard to redirect without stopping them. The number does not capture it.

Queued messages sat unread while the agent worked through its plan, so steering mid-run meant interrupting, not nudging. I build these apps by watching and steering rather than writing, so the moment the wheel goes dead is the moment I feel it. It locks in. I do not wait for the queue to clear. I hit escape and hand it the message directly, because I have burned myself before letting a run go down the wrong road.

If I had one sentence for a builder about to turn it on: do not say build until you know you are ready. The planning loops are your best chance for clear direction. Once the build starts, the wheel stops answering.

It was excellent when I knew exactly what I wanted, and poor the moment I tried to brainstorm or turn the wheel halfway.

That is a feature for a well-specified task and a friction for a loose one. It was excellent when I knew exactly what I wanted, and poor the moment I tried to brainstorm or turn the wheel halfway. Spec the work tight and hand it off; keep brainstorming and redirects in an ordinary session. Stage the work to match.

What the careful actually bought

The quality question is the one lines cannot answer, so I answered it from the day instead. The code was not smarter. It made the same mistakes it always makes. The difference was that it caught them before I saw them. More tests, written during the build instead of after, and the breakage fixed inside the agent's own loop before it ever reached my screen.

That is the part that compounds. My habit is to add tests in a hardening round once the app has solidified. Ultracode moved the testing to the front, into the foundation, where it is cheaper to pour than to retrofit. If I am going to build apps that last, that is exactly where I want the care spent.

One concrete case, told honestly. On my Roblox launcher the new model caught a startup crash that was already sitting in the tree. The prior model had been ready to release that build. The crash would have been caught downstream anyway, by a smoke test, a CI gate, or the store submission, so this was not a disaster averted. But the earlier model would have shipped it; this one did not. Careful caught what the lighter run would have let through. That is one case, not a measurement, and I am keeping it that size.

What this still can't tell me

Whether the gain holds at scale, and whether it shows up as fewer bugs and less rework over weeks rather than one day. Those are the numbers that turn a felt improvement into a proven one, and I do not have them yet.

A few caveats, kept in front. The 3x is a measurement, not a bill. I am on a flat-rate subscription and ran the whole day under half my weekly cap, the needle barely moving. The headline reads expensive; I never felt it. The cost only bites if you are metered or near a cap, and the moment it is nearly free the question stops being whether careful is worth triple and becomes whether careful is worth having at all.

The comparison is also dirtier than the table looks. Four things matured at once: my tooling, my specs, the apps themselves, and the model. The tight eight-versus-eight window holds the slow drift roughly constant, but not the jump to eight parallel sessions I made the same week. And "lines written" counts every line the agent typed, including ones it rewrote or threw away, not the lines that survived into the final commit: one session shows 11,000 lines authored against 1,200 committed. Eight sessions a side over a day and a half is a shape, not a trend.

One last bit of honesty, against my own conclusion. I went looking three times for a case where the deliberation was wasted, burn that bought nothing, and I did not find one. That is either because the waste was small or because a flat plan and a habit of blaming my own spec hide it from me. I cannot tell which from inside my own head, so I am leaving it on the table rather than scoring it in my favor.

The agent did not get faster. It got more careful, and careful turned out to be worth having. The triple was never the question, because for me the triple was nearly free. What is left is steering: the mode answers a plan you have already thought through and ignores the half-formed turn, so the work moves to the front, where the wheel still responds. Spend the thinking before the build. Let the long run be long.

— Este

Method "Active time" sums the gaps between messages, counting only gaps under five minutes; longer gaps count as idle and are excluded. "Burn" is input plus output tokens, excluding cached context, which is served nearly free. "Lines" counts the content of Write and Edit calls, so it is gross authored volume, not net committed diff. Eight sessions per side over roughly a day and a half, measured across both machines. A shape, not a trend. The subagent and Workflow counts were verified directly against the session transcripts.

All Field Notes