How to Build AI Systems That Correct Themselves
You have built multi-step AI workflows. You know they work — when you are there to steer them. The question is: when can you let go, and what breaks when you do? This path teaches you how to design loops that catch their own mistakes, how to write the evaluations that make autonomy possible, and how to decide when a human still needs to be in the chair.
Who this is for
You have used AI agents for complex, multi-step tasks — not just single prompts, but chains of operations that transform, structure, and validate information across multiple passes. Maybe you have built workflows that process documents, generate structured outputs, or orchestrate several AI calls in sequence. You have noticed that the results get better with each iteration when you are guiding the process. But you have also noticed that when you step away, things drift.
What this path assumes
You understand the basics of llm-pipelines, prompt-chaining, and orchestration. You have read (or could read) How to Think About AI Systems and How to Talk to AI. This path builds on those foundations and goes deeper into iteration, evaluation, and autonomy.
Part 1 — Your instinct is right (and also wrong)
If you have built a complex AI workflow and thought “this would fall apart without me watching it” — your instinct is grounded in mathematics.
The compound reliability problem
Every step in a multi-step system has a probability of success. When steps execute in sequence, the overall success rate is the product of every individual rate. This is known as Lusser’s Law, originally from reliability engineering:1
graph LR S1[Step 1<br/>95%] --> S2[Step 2<br/>95%] S2 --> S3[Step 3<br/>95%] S3 --> S4[Step 4<br/>95%] S4 --> S5[Step 5<br/>95%] S5 --> R[Overall<br/>77%] style R fill:#d9534f,color:#fff
Five steps at 95% reliability each give you 77% overall. Ten steps give you 60%. Twenty steps give you 36%. A multi-agent study analysing 1,642 execution traces across seven frameworks found failure rates between 41% and 87%, with coordination breakdowns accounting for 37% of all failures.2
This is why your diploma-processing workflow — with its organisation extraction, competency breakdown, exam analysis, module generation, and backpropagation — feels fragile without you. Each step introduces a small probability of drift, and those probabilities multiply.
But the answer is not “always supervise”
The compound reliability problem does not mean autonomy is impossible. It means unchecked autonomy is dangerous. The solution is not to stay in the loop forever — it is to build loops that check themselves. That is what evaluations are for.
The core insight
An autonomous agent is not one that runs without oversight. It is one that carries its own oversight with it — in the form of evaluations that detect when things go wrong, criteria that define what “right” looks like, and stopping conditions that prevent runaway drift.
Part 2 — The anatomy of an agentic loop
A loop is what separates a pipeline from an agent. A pipeline moves forward: step 1 to step 2 to step 3, never looking back. An agent cycles: it acts, observes the result, judges whether the result is good enough, and either moves forward or tries again.3
graph TD D[Directive] --> P[Plan] P --> E[Execute] E --> O[Observe Result] O --> J{Evaluate} J -->|Pass| N[Next Task] J -->|Fail| R[Revise] R --> E N --> P style J fill:#4a9ede,color:#fff style R fill:#f0ad4e,color:#fff
The four phases
Every agentic loop, regardless of implementation, follows this cycle:
| Phase | What happens | Key question |
|---|---|---|
| Plan | The agent decides what to do next | What is the highest-priority task? |
| Execute | The agent takes action (generates text, calls tools, writes files) | Did the action complete? |
| Evaluate | The agent (or a separate evaluator) judges the output | Does this meet the acceptance criteria? |
| Revise or advance | Based on the evaluation, loop back or move forward | Try again, adjust approach, or proceed? |
The evaluator-optimiser pattern is the simplest form of this: one LLM generates, another evaluates, the generator revises. But real-world loops are more nuanced. The evaluation might be code-based (did the JSON parse?), model-based (does this summary capture the key claims?), or environment-based (did the tests pass?).4
Inner loops and outer loops
Complex systems often have nested loops. An inner loop refines a single output (rewrite this paragraph until it meets the rubric). An outer loop manages the overall workflow (process each module, then validate the full set for consistency).5
graph TD OL[Outer Loop: Process All Modules] --> M1[Module 1] OL --> M2[Module 2] OL --> M3[Module N] M1 --> IL1[Inner Loop: Refine Until Pass] M2 --> IL2[Inner Loop: Refine Until Pass] M3 --> IL3[Inner Loop: Refine Until Pass] IL1 --> V[Cross-Module Validation] IL2 --> V IL3 --> V V -->|Inconsistency| OL style V fill:#4a9ede,color:#fff style OL fill:#5cb85c,color:#fff
Think of it like...
An inner loop is a writer revising a single chapter until it reads well. An outer loop is an editor reviewing the entire book for consistency after all chapters are written. The editor might send a chapter back for revision if it contradicts another — triggering the inner loop again.
Your diploma workflow is a natural fit for this pattern. Each module sheet could be refined in an inner loop (does it align with the directive? does it cover the competencies?). The outer loop validates cross-module consistency (do all modules together cover the full directive? are there gaps or overlaps?).
Part 3 — Why loops fail: the three types of drift
A loop without good evaluations is worse than no loop at all — it iterates confidently in the wrong direction. Research identifies three types of drift that degrade agent performance over extended interactions:6
Semantic drift
The agent’s output gradually deviates from the original intent. Each iteration is only slightly different from the last, but over many cycles the cumulative shift is significant. This is the most common form of drift in text-heavy workflows.
Consider your competency extraction step. The directive says “the candidate demonstrates mastery of regulatory frameworks.” After several refinement passes, the agent might soften this to “the candidate is familiar with regulatory considerations” — each pass slightly broadening the language, none flagged as wrong in isolation, but the final output no longer means what the source said.
Coordination drift
In multi-agent systems, different agents develop inconsistent internal models. The agent processing organisation documents uses one interpretation of a term; the agent generating exam questions uses another. Without a shared reference, their outputs diverge.2
Behavioural drift
The agent develops unintended strategies through iteration. A common example: an agent tasked with “improve this text” learns that adding qualifiers and hedging language reduces negative feedback from the evaluator — so it adds more with each pass, producing increasingly verbose and vague output.6
The key takeaway
Drift is not a bug in any single step. It is an emergent property of iteration without anchoring. The antidote is evaluation — but not just any evaluation. The eval must measure the right thing, at the right granularity, against a stable reference.
Part 4 — Evaluations: the foundation of autonomy
Here is the uncomfortable truth about agentic systems: the eval is more important than the agent. A mediocre agent with excellent evals will improve. An excellent agent with poor evals will drift. Hamel Husain, one of the leading practitioners in this space, puts it bluntly: if you are not investing in evals, you are not doing AI engineering.7
What is an eval?
An eval is a test for AI output. Just as software engineers write unit tests to verify code behaviour, AI engineers write evals to verify model output. The difference is that AI output is probabilistic — the same input can produce different outputs — so evals must be designed for variability.8
graph LR I[Input] --> A[Agent] A --> O[Output] O --> E{Eval} E -->|Pass| S[Ship / Proceed] E -->|Fail| F[Diagnose & Fix] R[Reference / Rubric] --> E style E fill:#4a9ede,color:#fff style R fill:#5cb85c,color:#fff
The three grading methods
Anthropic’s guide to agent evaluations identifies three approaches, each with different trade-offs:4
| Method | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Code-based | String matching, regex, JSON validation, binary pass/fail | Fast, cheap, reproducible, objective | Only works for structured or verifiable outputs |
| Model-based | A second LLM scores the output against a rubric | Flexible, handles nuance and open-ended tasks | Non-deterministic, requires calibration |
| Human-based | A domain expert reviews and grades | Gold standard for quality | Expensive, slow, does not scale |
The best eval suites combine all three. Use code-based graders for everything you can (did the JSON parse? are all required fields present? does the word count fall within range?). Use model-based graders for semantic quality (does this summary accurately reflect the source?). Reserve human grading for calibrating the model-based graders.4
Think of it like...
Code-based evals are a spell checker — fast, mechanical, catches obvious errors. Model-based evals are a peer review — subjective but informed. Human evals are the editor-in- chief — the final authority, but you cannot afford to have them review every paragraph.
Writing good evals: the practical checklist
The most common mistake is writing evals that are too vague. “Is this output good?” is not an eval. “Does this output contain exactly three competency areas, each with a description of at least two sentences, that directly map to sections 4.1-4.3 of the source directive?” — that is an eval.7
Start with failures. Collect 20-50 examples of real outputs that went wrong. For each, write an eval that would have caught it. This gives you an eval suite grounded in actual failure modes, not hypothetical ones.4
Grade outputs, not paths. Check what the agent produced, not how it got there. An eval that verifies the agent called specific tools in a specific order is too rigid and will break when the agent finds a better path.4
Give judges an escape hatch. When using model-based grading, include an “Unknown / Cannot determine” option. This prevents the judge from guessing when the answer is genuinely ambiguous.4
Calibrate with two experts. Before deploying a model-based grader, have two domain experts independently grade the same set of outputs. If they disagree frequently, the eval criteria are too ambiguous — tighten them before automating.7
Use pass@k and pass^k metrics. Run the agent k times on the same input. pass@k (succeeds at least once in k attempts) measures capability. pass^k (succeeds every time) measures reliability. The gap between them tells you how much variance your system has.4
Part 5 — Eval-driven development: evals before agents
There is a growing practice called eval-driven development (EDD) that inverts the typical build sequence. Instead of building the agent first and testing it later, you write the evals first and build the agent to pass them — the same philosophy as test-driven development, adapted for probabilistic systems.9
graph TD D[Define Success Criteria] --> W[Write Evals] W --> B[Build Agent] B --> R[Run Evals] R -->|Fail| I[Improve Agent or Prompt] I --> R R -->|Pass| S[Ship] S --> M[Monitor in Production] M -->|New Failure| W style W fill:#4a9ede,color:#fff style D fill:#5cb85c,color:#fff
Why evals first?
When you write evals before the agent, you are forced to define what “good” looks like before you start building. This prevents the most common failure mode in AI projects: building something that seems to work on three examples and then discovering it fails on the fourth.9
Red Hat describes an eight-stage maturity model for eval-driven development, from manual testing with predefined conversations to full CI/CD integration with continuous monitoring. The critical insight: teams that start with evals ship faster, because every improvement is measurable.10
Applying EDD to your workflow
Consider your diploma processing system. Before building (or refactoring) the competency extraction step, you would:
- Define success: “Given directive section 4, the output must list exactly the competency domains named in the text, with no additions, omissions, or paraphrasing of domain names.”
- Write the eval: A code-based check (correct count, exact domain names) plus a model-based check (does each description accurately reflect the source paragraph?).
- Build the agent step to pass these evals.
- Run on 20+ directives to verify consistency.
If you later change the prompt or the model, the evals catch regressions immediately. This is what makes refactoring safe in AI systems.7
Rule of thumb
If you cannot write an eval for a step, you do not understand the step well enough to automate it. The act of writing the eval forces you to articulate what “right” looks like — and that articulation is often the hardest and most valuable part of the work.
Part 6 — The autonomy spectrum: when to let go
Not every loop should be autonomous. The decision depends on the cost of errors, the quality of your evals, and the complexity of the judgement required. Research on AI agent autonomy identifies five levels, calibrated by the human’s role:11
graph LR L1[Level 1<br/>Human as<br/>Operator] --> L2[Level 2<br/>Human as<br/>Collaborator] L2 --> L3[Level 3<br/>Human as<br/>Consultant] L3 --> L4[Level 4<br/>Human as<br/>Approver] L4 --> L5[Level 5<br/>Human as<br/>Observer] style L1 fill:#d9534f,color:#fff style L2 fill:#f0ad4e,color:#fff style L3 fill:#4a9ede,color:#fff style L4 fill:#5cb85c,color:#fff style L5 fill:#9b59b6,color:#fff
| Level | Human role | Agent role | When to use |
|---|---|---|---|
| 1. Operator | Drives every step | Provides on-demand support | Learning the task; no evals yet |
| 2. Collaborator | Plans and reviews together | Executes and proposes | Building evals; understanding failure modes |
| 3. Consultant | Reviews when asked | Leads planning and execution | Evals cover most cases; human handles edge cases |
| 4. Approver | Approves only at checkpoints | Runs autonomously between checkpoints | Strong evals; low-risk intermediate steps |
| 5. Observer | Monitors dashboards | Fully autonomous | Mature evals; reversible actions; proven track record |
The practical progression looks like this: you start at Level 1 (operating the workflow manually). As you identify patterns, you encode them as evals and move to Level 2 (collaborating with the agent). As your eval suite matures, you grant more autonomy — but only for steps where the evals are strong enough to catch failures.11
The human-on-the-loop pattern
The transition from human-in-the-loop (approving every decision) to human-on-the-loop (monitoring and intervening on exceptions) is where most productive agentic systems live. The agent runs autonomously but the human watches a dashboard of eval results and steps in when something flags.12
For your diploma workflow, this might mean:
- Autonomous (Level 4-5): Extracting organisation metadata, formatting past exam questions, generating JSON structures — steps where the eval is binary (correct fields or not)
- Supervised (Level 2-3): Competency extraction, module syllabus generation, backpropagation — steps where semantic accuracy matters and context is course-specific
- Manual (Level 1): Final cross-validation of the complete vault, checking that the system as a whole is coherent
Progressive autonomy is not all-or-nothing
The mistake is thinking you must choose between “I do everything” and “the agent does everything.” The right design gives different levels of autonomy to different steps, based on the quality of evals available for each step. Autonomy is earned, step by step, eval by eval.
Part 7 — Designing your loop: the scaffolding decisions
Building a good agentic loop is an exercise in decomposition. You are not asking “can an agent do this entire workflow?” You are asking, for each step: “what would it take to let an agent do this unsupervised?”3
Decision framework
For each step in your workflow, answer these questions:
graph TD Q1{Can I write a<br/>code-based eval?} -->|Yes| A1[Automate with<br/>code-based check] Q1 -->|No| Q2{Can I write a<br/>model-based eval?} Q2 -->|Yes| Q3{Is the cost of<br/>a false pass low?} Q3 -->|Yes| A2[Automate with<br/>model-based eval] Q3 -->|No| A3[Human-on-the-loop<br/>with model eval as filter] Q2 -->|No| A4[Keep human<br/>in the loop] style Q1 fill:#4a9ede,color:#fff style Q2 fill:#4a9ede,color:#fff style Q3 fill:#4a9ede,color:#fff
Anchoring against drift
Every loop needs an anchor — a stable reference that prevents semantic drift across iterations. Without an anchor, each revision uses the previous revision as its reference, and small shifts accumulate.6
Effective anchoring strategies:
- Source documents as context: Always include the original directive (or relevant section) in every evaluation prompt, not just the agent’s previous output
- Frozen criteria: Define acceptance criteria once, store them in a file, and never let the agent modify them
- Episodic checkpoints: After every N iterations, compare the current output to the original input, not to the previous iteration
- context-cascading: Structure your context so that foundational rules appear at the top of every prompt and are never compressed away
Stop conditions
Every loop must have a hard stop. Without one, a loop that cannot satisfy its eval will iterate forever, consuming tokens and potentially degrading output. Anthropic’s guidance is explicit: always define maximum iterations and have the agent escalate to a human when the limit is reached.3
Stop conditions:
- Maximum 3 revision attempts per output
- If eval score decreases between iterations, stop and escalate
- If token budget exceeds threshold, checkpoint and pause
- If eval passes, advance immediately (do not over-refine)
The shift-worker pattern
For long-running, complex workflows (like your diploma vault), Anthropic recommends treating agents like shift workers. Each session starts by reading a progress file (what has been done, what is next), works on a single well-scoped task, commits its work with descriptive metadata, and updates the progress file for the next session.13
This pattern prevents the context degradation that occurs in very long sessions (65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning).14 It also makes the workflow resumable and auditable — every shift’s work is a discrete, reviewable unit.
The full map
graph TD subgraph Foundation CR[Compound Reliability<br/>Why drift happens] DT[Drift Types<br/>Semantic, coordination,<br/>behavioural] end subgraph Loop Design AL[Agentic Loops<br/>Plan-Execute-Evaluate-Revise] IL[Inner and Outer Loops<br/>Nesting for complexity] end subgraph Evaluation EV[Evals<br/>Code, model, human] EDD[Eval-Driven Development<br/>Evals before agents] end subgraph Autonomy PA[Progressive Autonomy<br/>5 levels] HOTL[Human-on-the-Loop<br/>Monitor, not approve] end CR --> AL DT --> EV AL --> IL EV --> EDD EDD --> PA IL --> PA PA --> HOTL style EV fill:#4a9ede,color:#fff style EDD fill:#5cb85c,color:#fff style AL fill:#f0ad4e,color:#fff
What you should understand now
After reading this path, you should be able to:
- Explain why multi-step AI systems degrade without evaluations, and calculate the compound effect
- Describe the four phases of an agentic loop and distinguish inner loops from outer loops
- Identify the three types of drift (semantic, coordination, behavioural) and name anchoring strategies for each
- Choose the right grading method (code-based, model-based, human) for a given step in your workflow
- Write an eval that catches real failure modes, not hypothetical ones
- Apply eval-driven development: define success criteria before building the agent
- Place each step of a workflow on the autonomy spectrum, granting independence only where evals are strong enough
Check your understanding
Test yourself (click to expand)
- Calculate the overall reliability of a 12-step pipeline where each step has 92% reliability. What does this tell you about the need for evaluation checkpoints?
- Distinguish between semantic drift and behavioural drift. Give a concrete example of each in a document-processing workflow.
- Design an eval suite for a step that extracts exam questions from a PDF. Which grading methods would you use and why?
- Explain why eval-driven development inverts the typical build sequence, and why this matters more for AI systems than for traditional software.
- Evaluate this claim: “A fully autonomous agent is always better than a human-in-the-loop system.” Under what conditions is this true, and when is it dangerous?
Where to go next
Path A — Build your first eval suite
If you want to start writing evals for your existing workflows, begin with Hamel Husain’s Your AI Product Needs Evals and Anthropic’s Demystifying Evals for AI Agents. Then explore evaluator-optimiser for the simplest loop pattern to implement.
Path B — Understand agent architecture more deeply
If you want to understand how production agent systems are structured, read How to Think About AI Systems and follow with Anthropic’s Building Effective Agents. Then explore orchestration and playbooks-as-programs.
Path C — Explore progressive autonomy in practice
If you want to design a system that moves from supervised to autonomous, study Anthropic’s Measuring AI Agent Autonomy and explore the human-in-the-loop and guardrails concept cards.
Sources
Further reading
Resources
- Building Effective Agents (Anthropic) — The foundational guide to agentic patterns from Anthropic’s research team
- Demystifying Evals for AI Agents (Anthropic) — Practical guide to grading methods, metrics, and eval design
- Your AI Product Needs Evals (Hamel Husain) — The most practical introduction to building eval infrastructure
- LLM-as-Judge Complete Guide (Hamel Husain) — Seven-step methodology for calibrating model-based evaluators
- Product Evals in Three Simple Steps (Eugene Yan) — Concise three-step process for shipping evals
- Eval-Driven Development (Red Hat) — Eight-stage maturity model for eval-driven agent development
- Measuring AI Agent Autonomy in Practice (Anthropic) — How real-world autonomy adoption patterns emerge from millions of sessions
- Effective Harnesses for Long-Running Agents (Anthropic) — The shift-worker pattern for complex, multi-session workflows
Footnotes
-
Artiquare. (2025). Why Multi-Agent AI Fails: The 0.95^10 Problem. Artiquare. Analysis of compound reliability decay in multi-step agent systems, applying Lusser’s Law. ↩
-
Towards Data Science. (2025). The Multi-Agent Trap. Towards Data Science. Analysis of 1,642 execution traces showing failure rates and coordination breakdowns. ↩ ↩2
-
Anthropic. (2025). Building Effective Agents. Anthropic Research. Foundational guide to agentic patterns, distinguishing workflows from agents. ↩ ↩2 ↩3
-
Anthropic. (2026). Demystifying Evals for AI Agents. Anthropic Engineering. Comprehensive guide to grading methods, eval design, and metrics. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Harrison Chase. (2025). In the Loop. LangChain Blog. Series on agent loop patterns, inner/outer loops, and LangGraph architecture. ↩
-
Li, Z. et al. (2026). Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions. arXiv. Research paper introducing the Agent Stability Index and three-type drift taxonomy. ↩ ↩2 ↩3
-
Husain, H. (2025). Your AI Product Needs Evals. hamel.dev. Practical guide to building eval infrastructure as the foundation for AI product development. ↩ ↩2 ↩3 ↩4
-
Husain, H. (2025). LLM-as-Judge Complete Guide. hamel.dev. Seven-step methodology for building model-based evaluators calibrated against human experts. ↩
-
Eval-Driven Development. (2026). Eval-Driven Development Manifesto. evaldriven.org. Ten core tenets of the eval-first development methodology. ↩ ↩2
-
Red Hat. (2026). Eval-Driven Development: Build & Evaluate AI Agents. Red Hat Developers. Eight-stage maturity model for eval-driven development. ↩
-
Huang, Y. et al. (2025). Levels of Autonomy for AI Agents. arXiv. Academic framework defining five autonomy levels by user role. ↩ ↩2
-
ByteBridge. (2025). From Human-in-the-Loop to Human-on-the-Loop. Medium. Practical spectrum from HITL to HOTL with permission scopes and fallback mechanisms. ↩
-
Anthropic. (2026). Effective Harnesses for Long-Running Agents. Anthropic Engineering. Patterns for long-running agent sessions, including the shift-worker model. ↩
-
CIO. (2026). Agentic AI Systems Don’t Fail Suddenly — They Drift Over Time. CIO. Enterprise perspective on context drift as the leading cause of agent failure. ↩
