Why Agents Fail in Production
Most agent failures aren't model failures — they're orchestration failures. The model generates reasonable outputs. The problem is what happens between generations: state management, error recovery, context window overflow.
I spent 90 days testing three agent architectures in production. Here's what actually happened.
The orchestration problem
When you run multiple agents in sequence — one generating a plan, another executing it, a third verifying the result — state management becomes the dominant engineering challenge. Not because it's conceptually hard, but because the failure modes are invisible until production.
Consider a three-agent pipeline:
- Planner generates a task decomposition
- Executor runs each subtask against the codebase
- Verifier checks the output against the original intent
The Executor modifies shared state as it works. If step 2.3 of 5 fails, you need to roll back steps 2.1 and 2.2 before retrying. Without checkpoints, you restart from step 1 — regenerating the entire plan, which may now differ because the model is stochastic.
The naive approach
async def run_pipeline(task):
plan = await planner.generate(task)
for step in plan.steps:
result = await executor.run(step)
if not await verifier.check(result, step):
return await run_pipeline(task) # full restartThis works in demos. In production with real codebases, a full restart costs 30–90 seconds. Over 1,000 pipeline runs, we saw 340 full restarts — that's ~5 hours of wasted compute on restarts alone.
The fix: explicit state checkpoints
Snapshot the codebase state before each step. Roll back to the last good snapshot on failure:
async def run_pipeline(task):
plan = await planner.generate(task)
for i, step in enumerate(plan.steps):
checkpoint = state.snapshot()
result = await executor.run(step)
if not await verifier.check(result, step):
state.restore(checkpoint)
result = await executor.run(step, retry=True)
if i % 3 == 0:
state.persist() # durable checkpoint every 3 stepsThe cost of a checkpoint is ~200ms. The cost of a full restart is 30–90 seconds.
Results
| Metric | Without checkpoints | With checkpoints | Change |
|---|---|---|---|
| Mean completion | 45s | 38s | −16% |
| p95 completion | 180s | 52s | −71% |
| Full restarts | 340 / 1,000 | 12 / 1,000 | −96% |
| Compute cost | $47.20 | $31.80 | −33% |
The p95 improvement matters most. 180 seconds to 52 seconds — the difference between "the agent seems stuck" and "the agent recovered gracefully."
Checkpoints are the cheapest insurance in agent orchestration. The 200ms overhead per step is invisible. The 30–90s restart cost when you don't have them is not.
What we got wrong
Three things we didn't anticipate:
- Checkpoint size grows. After 50 steps, accumulated snapshots consumed 2GB. We added a sliding window — keep the last 5, evict older ones. Memory dropped 80%.
- Stochastic plans break deterministic rollback. When you restore state and retry, the model may generate a different plan. Fix: retry with temperature 0.
- The verifier is the bottleneck. Verification took 40% of total pipeline time. We moved to async verification — verify step N while executing step N+1. Throughput doubled.
The architecture that worked
class CheckpointedPipeline:
def __init__(self, planner, executor, verifier):
self.planner = planner
self.executor = executor
self.verifier = verifier
self.checkpoints = SlidingWindow(max_size=5)
async def run(self, task: str) -> Result:
plan = await self.planner.generate(task)
for i, step in enumerate(plan.steps):
self.checkpoints.save(self.state.snapshot())
result = await self.executor.run(step)
verification = self.verifier.check_async(result, step)
if i > 0:
prev_ok = await self.pending_verification
if not prev_ok:
self.state.restore(self.checkpoints.latest())
await self.executor.run(
plan.steps[i-1], temp=0
)
self.pending_verification = verification
return await self.pending_verification47 lines. Handles checkpointing, sliding window eviction, async verification, and deterministic retry. The entire orchestration layer is smaller than most configuration files.
Good infrastructure disappears. If your orchestration code is longer than your prompt engineering, something is wrong.
Next in this series: evaluation without ground truth.