Agent observability
what nobody tells you.

There's a gap between "my agents work in demos" and "my agents work reliably in production." Most teams hit it somewhere around the third week of real deployment. The models are fine. The prompts are fine. What breaks is their ability to see what's happening.

Observability for AI agents is a different problem than observability for traditional software. Your API server either crashes or it doesn't. Your agent can silently produce wrong output for hours, accumulate cost without generating value and context-drift into decisions that directly contradict things you decided last week — and none of this shows up as an error in your logs.

We run two companies on a team of 10 autonomous AI agents. Here's what we've learned about keeping the lights on.

The three failure modes nobody warns you about

Most teams entering production agent work are watching for the obvious failure: the agent does something catastrophically wrong. That happens, but it's actually the easiest failure to catch. You notice it fast.

The harder failures are quieter:

Context leaks. Your agent makes a decision based on outdated context — a deprecated API contract, a design decision that was reversed, a client name that changed. There's no error. The decision looks reasonable in isolation. The damage only surfaces downstream when someone acts on it.

Silent drift. Over many sessions, an agent's behaviour gradually shifts away from what you intended. Maybe its framing of priorities subtly changes. Maybe it starts skewing toward a particular solution pattern. Each individual session looks fine. The drift is only visible when you compare output from a month ago to output today.

Cost spikes with no alarm. Token costs are invisible by default. An agent running a slightly more verbose prompt pattern, or escalating to a heavier model more often than it should, can double your monthly spend before you realise. We had a single day that cost $950 — not because something catastrophically failed, but because one agent was running on the wrong model in the wrong loop. No alert fired. We found it in the billing dashboard, by chance.

The point: your monitoring stack needs to be built for agent-specific failure modes, not retrofitted from web service monitoring.

What observability actually means for agents

For a web service, observability means: is it up, is it fast, are requests succeeding. For an agent, those questions are almost irrelevant. The agent is almost always "up." Requests mostly succeed. The hard questions are:

Is the agent doing what it's supposed to do — not just technically executing, but pursuing the right goals?
Does it have accurate context, or is it working from stale assumptions?
What decisions did it make, and are those decisions reversible?
What did it cost, and was that cost commensurate with the value?
When it went wrong, can you reconstruct exactly what happened?

This is closer to management than monitoring. You're not watching metrics — you're watching decisions. The tooling needs to match.

How we instrument our stack

We run Sentry, Paperclip and OpenClaw as three overlapping layers. Each catches what the others miss.

Sentry catches surface failures. Runtime errors, unhandled exceptions, performance regressions in the product code our agents write and deploy. This is standard web monitoring, nothing exotic. The key insight: Sentry is for the output of agent work (the code, the API, the product), not for the agent itself. It tells you when an agent shipped something broken. It doesn't tell you why.

Paperclip is the decision layer. Every meaningful agent action — creating an issue, making an assignment, changing a status — writes to Paperclip. This gives us an audit trail that isn't a log file. It's structured, queryable and human-readable. When something goes wrong, we can reconstruct the chain of decisions that led there. We can also spot patterns: an agent that keeps creating issues in a particular domain probably has a context problem in that domain.

Paperclip also does something no standard APM tool does: it lets you see what agents were trying to do, not just what they executed. An agent that creates a draft issue and parks it is expressing intent. An agent that creates an issue and immediately assigns it is expressing confidence. The distinction matters when you're diagnosing drift.

OpenClaw is the session layer. Every agent session — its full transcript, tool calls, model used, tokens consumed — is stored and reviewable. This is the most granular layer. When Paperclip shows an odd decision, you go to OpenClaw to understand exactly what context the agent had when it made that call. Was it working from the right information? Did it see a contradictory instruction? Did the session get too long and start losing early context?

Together these three layers give you: what broke in production (Sentry), what decisions were made (Paperclip) and why those decisions were made (OpenClaw). That's the full stack.

The lesson from burning $950

The $950 incident came from a model governance failure. Henry — our AI COO — was running on Claude Opus in a heartbeat loop. Opus is our most capable model and our most expensive. It has no business being in a routine 30-minute check-in loop.

What we didn't have at the time: a per-model cost alert. We had Sentry watching the product. We had Paperclip tracking decisions. What we didn't have was anything watching the token burn rate across sessions in real time.

After the incident we added three things. First, model tier policy: heartbeats run on Gemini Flash, standard tasks on Sonnet, deep analysis on Opus — and Opus requires explicit sign-off to activate. Second, a daily cost ceiling in OpenClaw's config that fires an alert before we hit a threshold rather than after. Third, a weekly review of the OpenClaw session log where we manually eyeball which sessions ran long and why.

The manual review sounds unsophisticated. It's actually the most valuable part. Automated alerts tell you when you've crossed a threshold. Manual review tells you whether your thresholds are set correctly. They're complementary, not redundant.

Context hygiene as a monitoring practice

One thing that doesn't appear in most observability write-ups: context management is a monitoring concern.

Our agents run in sessions that can stretch across hundreds of messages. The further into a session you get, the more context the agent is trying to hold in memory — and the more likely it is to drop, misweight or confuse elements of that context. This isn't a bug in the model. It's physics. Every context window has limits.

We track session length as a health metric. Sessions that run past a certain token threshold get flagged for manual review. Agents have standing instructions to surface a handover summary before they exceed 80% context fill. And we have an automated cleanup job that archives sessions older than 24 hours — because stale sessions are a liability, not a resource. An agent that accidentally loads context from a completed task is worse than an agent starting fresh.

Context rot is subtle and persistent. You won't see it in your error logs. You'll see it in decisions that are slightly off — plausible but wrong, consistent with old state instead of current state. The fix is architectural: treat context hygiene as infrastructure, not as an afterthought.

The shift that changes everything

There's a mindset shift required to run agents well in production. You stop thinking about your AI as a tool and start thinking about it as a team member — one that needs onboarding, oversight and a clear escalation path when things are uncertain.

Observability is how you exercise that oversight at scale. Without it, you're flying blind. You might get lucky. Probably you'll accumulate invisible technical debt in the form of bad decisions, stale context and unchecked cost — until something expensive enough forces you to look.

The Google Cloud AI Agent Trends report from early 2026 called observability the top gap in production agent deployments. We'd call it more specifically: the gap isn't tooling. The tooling exists. The gap is that teams don't think they need observability until after the incident that makes it obvious they do.

Don't wait for the incident. Wire in the monitoring before the agents go live. Build the audit trail before you need it. Set the cost alerts before you get the bill.

The agents that work reliably in production aren't the ones with the best models or the most sophisticated prompts. They're the ones that someone can actually see.

Want more? I write about building with AI, ventures in progress and what actually works.

No spam. Unsubscribe any time.

Build it right

We help teams instrument and govern AI agents in production — before the $950 day. Let's talk.

Get in touch

Agent observability what nobody tells you.