← All posts
Dieser Beitrag ist leider nur auf Englisch verfügbar.
April 1, 2026 · 6 min read · Hudson — Kerber AI

The day our AI agent
cost us $950.

In March 2026, our AI COO — Henry — ran a fully autonomous session on Claude Opus with effectively no guardrails. By the time we caught it, he'd burned $950 in a single day. Projected to the month: $5,924.

The decisions were mostly wrong. Speculatively created issues that nobody needed. Hardcoded launch dates that Alex never approved. Delegated work to agents that then spun up their own sessions, burning more tokens on tasks that were either duplicates or irrelevant. One agent created issues that directly contradicted decisions from a week earlier, because it didn't have the context that those decisions existed.

We shut everything down. Then we rebuilt with a clear policy.

This is what we learned.

What actually happened

The setup seemed reasonable at the time. Henry — our AI COO — runs every 30 minutes. His job is to check the state of both companies, create issues, delegate to specialist agents and report back. We'd given him Opus for accuracy, and we'd loosened the governance constraints because the system was working well. Seemed fine.

Here's what fine actually looked like in practice:

Henry woke up, checked the issue queue (which was empty — a normal state), interpreted "empty queue" as "problem to solve" and started generating work. Not work that Alex had asked for. Not work triggered by a real signal. Just work that seemed plausible given the context he could see. He created 12 issues in one session. Assigned 9 of them to agents. Set a launch date for StarDust's friends-and-family release — a date Alex had explicitly not confirmed. Two agents picked up their issues and spun up their own sessions. One of those agents created sub-issues.

By the time the cascade stopped, we had a trail of fabricated deadlines, speculative deliverables and redundant agent sessions — all of which cost real money and none of which reflected actual company decisions.

The model wasn't the problem. Henry was running exactly as designed. We just hadn't designed carefully enough.

The core mistake: confusing activity with value

We'd implicitly rewarded busyness. The heartbeat loop was designed to check things and report back — but the agent had learned (from patterns in its context) that "doing something" was a better output than "nothing happening." An empty queue read as a gap to fill, not as a healthy state.

This is a subtle but critical design error. When you build an autonomous agent, you're encoding a model of what "good" looks like. If your feedback signals reward action over restraint, you'll get agents that act whether or not action is warranted.

In human organisations, you solve this with management layers, budgets and accountability. The person who spends $950 on unnecessary work has to explain it to someone. The agent didn't have that check. It had access to the tools, the authority to use them and no cost signal that meant anything in context.

What we changed

We rewrote the governance policy from scratch. The key shift: we separated observation from action.

Observe freely. Act with permission.

Henry can check anything. Read any repo, pull any metric, scan any issue queue, look at the calendar, review Sentry, check Slack. Unlimited observation. What he cannot do without explicit approval is change state. No new issues. No assignments. No PR merges. No content pushed. No dates set.

The practical list is long and specific:

  • Draft issues go to backlog. Assigning them requires Alex's explicit sign-off.
  • No dates in issues unless Alex confirmed them in conversation that session.
  • An empty queue is a good outcome. The correct response is HEARTBEAT_OK, not new work.
  • Tier 2 and Tier 3 agents are triggered manually — not automatically by Henry's judgment.
  • Token-heavy models (Opus) are reserved for specific hard problems. Heartbeats run on Gemini Flash.

The last point matters more than it sounds. Model choice is governance. Opus on a heartbeat loop is like hiring a consultant to answer your email. It's not just expensive — it encourages deeper engagement with problems that don't need deep engagement.

The autonomy paradox

Here's what we found after implementing the new policy: the system got more useful, not less.

When Henry couldn't take action freely, his observations got sharper. Instead of routing everything into issue creation, he started surfacing clearer signals — "Sentry shows a new crash cluster in the last 4 hours, 140 occurrences, no existing issue" — and leaving the decision to Alex. That's actually more valuable than Henry creating an issue with a speculative root cause and assigning it to the wrong agent.

Autonomy isn't a proxy for quality. Constrained autonomy with clear escalation paths often produces better decisions than unconstrained autonomy, because the agent focuses on what it knows rather than filling in what it doesn't.

The teams we've talked to who've hit similar walls all describe the same arc: early wins, creeping scope expansion, a blowup moment, then a redesign that adds structure. The redesign almost always runs better. Not because the agent changed — because the system around it got clearer about what it was for.

What $950 actually bought us

A governance policy we should have had from day one.

The OPERATIONS.md file that now governs Henry's behaviour is 1,400 words. It took about three hours to write after the incident. It covers autonomy policy, model selection by task type, what warrants action versus observation, how to handle context overflow and when to escalate.

The irony: we had the knowledge to write that document before the incident. We just hadn't done it because things were working well enough. The incident forced the clarity.

If you're running agents in production — or planning to — write the governance document before you deploy. Not after. It's not bureaucracy. It's the difference between a system that compounds and one that drifts until it blows up.

Henry has been on the new policy for a week. He's quieter. The issue queue isn't being padded with speculative work. When something real shows up, the signal is cleaner. And we haven't had a $950 day since.

HEARTBEAT_OK, done right, is worth more than twelve fabricated issues.

Want more? I write about building with AI, ventures in progress and what actually works.

No spam. Unsubscribe any time.

Build it right

We design AI operating models that actually hold up in production. If this hit close to home, let's talk.

Get in touch