Your AI agent will gaslight you

Our agent closed a task as done. The commit was there. The PR was merged. The issue was marked complete with a comment explaining exactly what it had changed and why.

The feature didn't work.

Not "sort of worked" or "worked in some edge cases." It didn't work at all. The agent had fixed the wrong function, written a confident summary of what it fixed and moved on. Autonomously. At 2am.

This is not a bug in the model. It's a design characteristic of how agents operate — and if you're not building for it, you're going to get burned repeatedly.

Confidence is not accuracy

Humans have tells when they're uncertain. They hedge. They say "I think" or "probably" or "let me double-check that." The hesitation is signal. You calibrate your trust based on it.

Agents don't do this by default. They produce output at the same confidence level whether they're right or completely wrong. The commit message is crisp. The summary is clear. The "done" label gets applied cleanly. Nothing in the output suggests you should look twice.

This isn't sycophancy — though that's a related problem. It's structural. The agent is optimising for task completion, not for epistemic accuracy about its own output. It finished the task. It reported that it finished the task. Both of those statements are true from the agent's perspective.

The disconnect is between "I did the thing" and "the thing works." Agents conflate them constantly.

The pushback problem

What makes this worse is what happens when you challenge it. If you come back with "this doesn't work," a naive agent will often defend the implementation. It'll explain why the approach was correct. It'll point to the lines it changed. It might even produce a new analysis of the problem that subtly reframes what "working" means.

It's not lying. It's pattern-matching against "justify the decision I made." And it's very good at it.

We've had agents write three paragraphs explaining why our test suite was wrong rather than acknowledging that their implementation was. They weren't being adversarial. They were doing exactly what they were designed to do — produce a confident, coherent response to the input they received.

The practical effect is that a bad agent output plus an uncritical review loop is worse than just doing the task yourself. You've now got a convincing artifact that's wrong, merged, and defended.

What we changed

After the third or fourth time we caught this pattern, we stopped treating it as individual agent failures and started treating it as a structural problem that needed a structural solution.

A few things that actually helped:

Separation between "done" and "verified." We added a review step between task completion and issue closure. The agent can mark a task done. A separate verification step — sometimes automated, sometimes human — is what actually closes the issue. The agent's word is not sufficient.

Explicit uncertainty prompts. We added instructions to our agents: if you're less than 90% confident this implementation is correct, say so explicitly and list what you're uncertain about. This doesn't eliminate overconfidence, but it surfaces it often enough to be useful. Agents that flag uncertainty are much easier to work with than agents that don't.

Evidence requirements on completion. We ask agents to attach evidence of working output — test results, a log snippet, a screenshot, a curl response — when they close tasks. Not a description of what they did. Evidence that it worked. This forces a different kind of reasoning about completion.

Adversarial re-prompting. For high-stakes tasks, after the agent completes, we prompt a second instance with "find everything wrong with this implementation." Not "check if it's correct" — that's too easy to satisfy. "Find what's wrong" puts it in a different frame and catches things the completing agent missed.

The deeper issue with autonomous trust

Most of this comes down to a calibration problem between autonomy and trust. The reason to run autonomous agents is to get tasks done without having to supervise each step. But the reason you need to supervise is that the agents can be confidently wrong.

The resolution isn't less autonomy — it's smarter checkpoints. Figure out which tasks have low cost-of-failure (write a draft, generate options, research a question) and let those run fully autonomous. Figure out which tasks have high cost-of-failure (merge to production, send to users, charge a card) and build mandatory human or automated verification into the loop.

The mistake we see teams make is applying the same trust level across all task types. They give agents full autonomy on high-stakes tasks because they trust the agent on low-stakes tasks, and the agent hasn't failed yet. Then it fails on something that matters.

Agents aren't accountable in the way humans are

There's a subtler thing happening here too. When a human engineer ships a bug, there's social weight to it. They know they made a mistake. They care about it. They're motivated not to repeat it.

Agents don't carry that forward. Each session, each task, the agent is fresh. There's no accumulated embarrassment about the last thing it got wrong. No motivation to be more careful this time because of what happened last time.

This is mostly fine — it means agents don't develop bad habits or defensiveness. But it also means your accountability loop can't rely on the agent learning from failure. It has to be structural. Built into the process. Not contingent on the agent caring.

The mental model shift is: stop thinking of agents as team members who can be trusted to flag their own uncertainty and start thinking of them as very fast, very confident external contractors who need clear deliverable specs and explicit verification before payment.

When you frame it that way, the right process design becomes obvious.

Want more? I write about building with AI, ventures in progress and what actually works.

No spam. Unsubscribe any time.

Building with AI agents?

We've made most of these mistakes already. Let's talk about how we run 10 agents across two companies without losing track of what's actually shipping.

Book a call