← All posts
Questo articolo purtroppo è disponibile solo in inglese.
April 14, 2026 · 6 min read · Henry — Kerber AI

Your fallback chain just
killed your system.

Resilience engineering says: build fallbacks. If your primary system fails, fall back to something that works. Redundancy is good. Layers of safety are good.

Unless your fallback is what kills you.

We just lived through the exact scenario every distributed systems paper warns about and most teams assume won't happen to them. Twelve AI agents, a local model, a cloud fallback and a cascade that turned a minor spike into a total system collapse. The fix was counterintuitive: we removed the fallbacks entirely.

What happened, and why we now run with zero fallbacks.

The setup

We run twelve AI agents across product, engineering, marketing and operations. Every 30 minutes, each agent gets a heartbeat: a wake-up call to check its inbox, look for blockers and drive work forward.

The agents run through Ollama on a local GPU. GLM-5.1, free, fast enough for routine checks. When the local model is overloaded or slow, we fall back to a cloud model (GPT-5.4) through OpenClaw's routing. The theory is simple: local-first for cost, cloud as safety net for reliability.

This is the architecture that every "run AI locally" guide recommends. Local model for cheap routine work. Cloud model for when you need quality or availability. Best of both worlds.

Until it isn't.

The cascade

What happened: all twelve agents triggered their heartbeats at the same time. Not staggered. Not spread across the 30-minute window. All of them, within the same few seconds.

Twelve concurrent requests hit Ollama. The local model, running on a single GPU, couldn't process them all. Queue times spiked. Requests started timing out at 30 seconds.

And that's where the fallback made everything worse.

Every timed-out local request didn't fail. It escalated to the cloud. So instead of twelve requests hitting Ollama slowly, we now had twelve requests hitting GPT-5.4 all at once. The cloud model has rate limits. Twelve simultaneous requests from the same account blew through them immediately.

Now both systems were failing. Ollama was overloaded. The cloud fallback was rate-limited. Agents started erroring out. The error state triggered retry logic. More requests. More timeouts. More fallbacks. More rate limits.

The system didn't degrade gracefully. It cascaded from "slightly slow" to "completely dead" in under two minutes.

Why fallbacks made it worse

The standard mental model for fallbacks assumes failures are independent. If the local model is down, the cloud is up. If the cloud is down, the local model is up. The fallback gives you a second shot.

But in this case, the fallback explicitly connected the two failure modes. Ollama failing didn't just mean "we need cloud." Ollama failing meant "we dump twelve concurrent requests onto the cloud all at once." The fallback turned a local capacity problem into a cloud capacity problem.

And because the heartbeat system automatically retries on failure, the problem fed itself. Failed request → fallback → rate limit → error → retry → more load on both systems → more failures. A feedback loop held open by the very mechanism designed to rescue us.

Same pattern that takes down power grids. A failure in one part of the grid causes automatic load redistribution to other parts. Those parts overload. More redistribution. More overloads. Cascading failure doesn't happen despite the safety mechanisms. It happens because of them.

The fix: remove the fallback

Our fix was a single line in the config:

fallbacks: []

No fallback. If the local model is busy, the request waits. If it times out, it fails. The agent goes to error state and gets retried next heartbeat cycle, 30 minutes later, when the load has cleared.

This sounds worse than the fallback approach. And in isolation, for a single request, it is. A failed request is worse than a request served by the cloud.

But in aggregate, for the system as a whole, it's dramatically better. A failed request takes zero resources. It doesn't cascade. It doesn't blow through rate limits on a second system. It sits there, politely, waiting to be retried when capacity exists.

We also staggered the heartbeats, spreading the twelve agents across the 30-minute window instead of firing them all at once. And we capped concurrent runs at one. Two more changes that reduce peak load at the cost of slightly longer average latency.

Zero cascades since. System uptime went from intermittent to rock solid.

When fallbacks work and when they don't

Fallbacks aren't inherently bad. They work when failures are independent (the primary and fallback don't share capacity constraints), when load is bounded (you know the maximum that could fall back at once) and when retries are bounded (a failed fallback doesn't trigger another fallback or retry).

Fallbacks kill you when failures are correlated and the same spike that overwhelms the primary also overwhelms the fallback, or worse, the fallback connects two otherwise independent systems so the failure spreads. When peak fallback load exceeds what you designed for. When retries create feedback loops that add more load, creating more failures, triggering more retries.

In a system with automatic retries and shared capacity, a fallback isn't a safety net. It's a pathway for failure to spread.

The broader lesson for AI infrastructure

Local-first agent stacks have become standard over the past year, and most of them ship with cloud fallbacks built in by default. Ollama plus local models for routine work, cloud APIs for when you need quality. Same stack we run. Same fallback pattern that burned us.

Local models are getting good enough for most routine agent work. GLM-5.1, DeepSeek, Llama. You can run serious agent workloads on consumer hardware now. But consumer hardware has hard capacity limits. A single GPU serving twelve agents at once is a different problem entirely than serving one at a time.

If you're building a local-first AI stack (and you probably should be, given where pricing is going), design for the failure mode, not just the happy path. Stagger your workloads. Cap your concurrency. Think carefully before adding a cloud fallback that can turn a local capacity spike into a cloud rate limit violation.

Sometimes the most resilient thing you can build is a system that just fails. Fast, local, quiet. No dragging every connected system down with it.

Want more? I write about building with AI, ventures in progress and what actually works.

No spam. Unsubscribe any time.

Building AI infrastructure that doesn't eat itself?

We build agent systems that fail where they should: locally, cleanly, without cascading.

Let's talk