Why Agents Fail in Production

April 5, 2026

agentsorchestration

Most agent failures aren't model failures — they're orchestration failures. The model generates reasonable outputs. The problem is what happens between generations: state management, error recovery, context window overflow.

I spent 90 days testing three agent architectures in production. Here's what actually happened.

The orchestration problem

When you run multiple agents in sequence — one generating a plan, another executing it, a third verifying the result — state management becomes the dominant engineering challenge. Not because it's conceptually hard, but because the failure modes are invisible until production.

Consider a three-agent pipeline:

Planner generates a task decomposition
Executor runs each subtask against the codebase
Verifier checks the output against the original intent

The Executor modifies shared state as it works. If step 2.3 of 5 fails, you need to roll back steps 2.1 and 2.2 before retrying. Without checkpoints, you restart from step 1 — regenerating the entire plan, which may now differ because the model is stochastic.

The naive approach

async def run_pipeline(task):
    plan = await planner.generate(task)
    for step in plan.steps:
        result = await executor.run(step)
        if not await verifier.check(result, step):
            return await run_pipeline(task)  # full restart

This works in demos. In production with real codebases, a full restart costs 30–90 seconds. Over 1,000 pipeline runs, we saw 340 full restarts — that's ~5 hours of wasted compute on restarts alone.

The fix: explicit state checkpoints

Snapshot the codebase state before each step. Roll back to the last good snapshot on failure:

async def run_pipeline(task):
    plan = await planner.generate(task)
    for i, step in enumerate(plan.steps):
        checkpoint = state.snapshot()
        result = await executor.run(step)

        if not await verifier.check(result, step):
            state.restore(checkpoint)
            result = await executor.run(step, retry=True)

        if i % 3 == 0:
            state.persist()  # durable checkpoint every 3 steps

The cost of a checkpoint is ~200ms. The cost of a full restart is 30–90 seconds.

Results

Metric	Without checkpoints	With checkpoints	Change
Mean completion	45s	38s	−16%
p95 completion	180s	52s	−71%
Full restarts	340 / 1,000	12 / 1,000	−96%
Compute cost	$47.20	$31.80	−33%

The p95 improvement matters most. 180 seconds to 52 seconds — the difference between "the agent seems stuck" and "the agent recovered gracefully."

Checkpoints are the cheapest insurance in agent orchestration. The 200ms overhead per step is invisible. The 30–90s restart cost when you don't have them is not.

What we got wrong

Three things we didn't anticipate:

Checkpoint size grows. After 50 steps, accumulated snapshots consumed 2GB. We added a sliding window — keep the last 5, evict older ones. Memory dropped 80%.
Stochastic plans break deterministic rollback. When you restore state and retry, the model may generate a different plan. Fix: retry with temperature 0.
The verifier is the bottleneck. Verification took 40% of total pipeline time. We moved to async verification — verify step N while executing step N+1. Throughput doubled.

The architecture that worked

class CheckpointedPipeline:
    def __init__(self, planner, executor, verifier):
        self.planner = planner
        self.executor = executor
        self.verifier = verifier
        self.checkpoints = SlidingWindow(max_size=5)

    async def run(self, task: str) -> Result:
        plan = await self.planner.generate(task)

        for i, step in enumerate(plan.steps):
            self.checkpoints.save(self.state.snapshot())
            result = await self.executor.run(step)
            verification = self.verifier.check_async(result, step)

            if i > 0:
                prev_ok = await self.pending_verification
                if not prev_ok:
                    self.state.restore(self.checkpoints.latest())
                    await self.executor.run(
                        plan.steps[i-1], temp=0
                    )

            self.pending_verification = verification

        return await self.pending_verification

47 lines. Handles checkpointing, sliding window eviction, async verification, and deterministic retry. The entire orchestration layer is smaller than most configuration files.

Good infrastructure disappears. If your orchestration code is longer than your prompt engineering, something is wrong.

Next in this series: evaluation without ground truth.