How to Build AI Systems That Correct Themselves

You have built multi-step AI workflows. You know they work — when you are there to steer them. The question is: when can you let go, and what breaks when you do? This path teaches you how to design loops that catch their own mistakes, how to write the evaluations that make autonomy possible, and how to decide when a human still needs to be in the chair.

Who this is for

You have used AI agents for complex, multi-step tasks — not just single prompts, but chains of operations that transform, structure, and validate information across multiple passes. Maybe you have built workflows that process documents, generate structured outputs, or orchestrate several AI calls in sequence. You have noticed that the results get better with each iteration when you are guiding the process. But you have also noticed that when you step away, things drift.

What this path assumes

You understand the basics of llm-pipelines, prompt-chaining, and orchestration. You have read (or could read) How to Think About AI Systems and How to Talk to AI. This path builds on those foundations and goes deeper into iteration, evaluation, and autonomy.

Part 1 — Your instinct is right (and also wrong)

If you have built a complex AI workflow and thought “this would fall apart without me watching it” — your instinct is grounded in mathematics.

The compound reliability problem

Every step in a multi-step system has a probability of success. When steps execute in sequence, the overall success rate is the product of every individual rate. This is known as Lusser’s Law, originally from reliability engineering:¹

graph LR
    S1[Step 1<br/>95%] --> S2[Step 2<br/>95%]
    S2 --> S3[Step 3<br/>95%]
    S3 --> S4[Step 4<br/>95%]
    S4 --> S5[Step 5<br/>95%]
    S5 --> R[Overall<br/>77%]

    style R fill:#d9534f,color:#fff

Five steps at 95% reliability each give you 77% overall. Ten steps give you 60%. Twenty steps give you 36%. A multi-agent study analysing 1,642 execution traces across seven frameworks found failure rates between 41% and 87%, with coordination breakdowns accounting for 37% of all failures.²

This is why your diploma-processing workflow — with its organisation extraction, competency breakdown, exam analysis, module generation, and backpropagation — feels fragile without you. Each step introduces a small probability of drift, and those probabilities multiply.

But the answer is not “always supervise”

The compound reliability problem does not mean autonomy is impossible. It means unchecked autonomy is dangerous. The solution is not to stay in the loop forever — it is to build loops that check themselves. That is what evaluations are for.

The core insight

An autonomous agent is not one that runs without oversight. It is one that carries its own oversight with it — in the form of evaluations that detect when things go wrong, criteria that define what “right” looks like, and stopping conditions that prevent runaway drift.

Part 2 — The anatomy of an agentic loop

A loop is what separates a pipeline from an agent. A pipeline moves forward: step 1 to step 2 to step 3, never looking back. An agent cycles: it acts, observes the result, judges whether the result is good enough, and either moves forward or tries again.³

graph TD
    D[Directive] --> P[Plan]
    P --> E[Execute]
    E --> O[Observe Result]
    O --> J{Evaluate}
    J -->|Pass| N[Next Task]
    J -->|Fail| R[Revise]
    R --> E
    N --> P

    style J fill:#4a9ede,color:#fff
    style R fill:#f0ad4e,color:#fff

The four phases

Every agentic loop, regardless of implementation, follows this cycle:

Phase	What happens	Key question
Plan	The agent decides what to do next	What is the highest-priority task?
Execute	The agent takes action (generates text, calls tools, writes files)	Did the action complete?
Evaluate	The agent (or a separate evaluator) judges the output	Does this meet the acceptance criteria?
Revise or advance	Based on the evaluation, loop back or move forward	Try again, adjust approach, or proceed?

The evaluator-optimiser pattern is the simplest form of this: one LLM generates, another evaluates, the generator revises. But real-world loops are more nuanced. The evaluation might be code-based (did the JSON parse?), model-based (does this summary capture the key claims?), or environment-based (did the tests pass?).⁴

Inner loops and outer loops

Complex systems often have nested loops. An inner loop refines a single output (rewrite this paragraph until it meets the rubric). An outer loop manages the overall workflow (process each module, then validate the full set for consistency).⁵

graph TD
    OL[Outer Loop: Process All Modules] --> M1[Module 1]
    OL --> M2[Module 2]
    OL --> M3[Module N]
    M1 --> IL1[Inner Loop: Refine Until Pass]
    M2 --> IL2[Inner Loop: Refine Until Pass]
    M3 --> IL3[Inner Loop: Refine Until Pass]
    IL1 --> V[Cross-Module Validation]
    IL2 --> V
    IL3 --> V
    V -->|Inconsistency| OL

    style V fill:#4a9ede,color:#fff
    style OL fill:#5cb85c,color:#fff

Think of it like...

An inner loop is a writer revising a single chapter until it reads well. An outer loop is an editor reviewing the entire book for consistency after all chapters are written. The editor might send a chapter back for revision if it contradicts another — triggering the inner loop again.

Your diploma workflow is a natural fit for this pattern. Each module sheet could be refined in an inner loop (does it align with the directive? does it cover the competencies?). The outer loop validates cross-module consistency (do all modules together cover the full directive? are there gaps or overlaps?).

Part 3 — Why loops fail: the three types of drift

A loop without good evaluations is worse than no loop at all — it iterates confidently in the wrong direction. Research identifies three types of drift that degrade agent performance over extended interactions:⁶

Semantic drift

The agent’s output gradually deviates from the original intent. Each iteration is only slightly different from the last, but over many cycles the cumulative shift is significant. This is the most common form of drift in text-heavy workflows.

Consider your competency extraction step. The directive says “the candidate demonstrates mastery of regulatory frameworks.” After several refinement passes, the agent might soften this to “the candidate is familiar with regulatory considerations” — each pass slightly broadening the language, none flagged as wrong in isolation, but the final output no longer means what the source said.

Coordination drift

In multi-agent systems, different agents develop inconsistent internal models. The agent processing organisation documents uses one interpretation of a term; the agent generating exam questions uses another. Without a shared reference, their outputs diverge.²

Behavioural drift

The agent develops unintended strategies through iteration. A common example: an agent tasked with “improve this text” learns that adding qualifiers and hedging language reduces negative feedback from the evaluator — so it adds more with each pass, producing increasingly verbose and vague output.⁶

The key takeaway

Drift is not a bug in any single step. It is an emergent property of iteration without anchoring. The antidote is evaluation — but not just any evaluation. The eval must measure the right thing, at the right granularity, against a stable reference.

Part 4 — Evaluations: the foundation of autonomy

Here is the uncomfortable truth about agentic systems: the eval is more important than the agent. A mediocre agent with excellent evals will improve. An excellent agent with poor evals will drift. Hamel Husain, one of the leading practitioners in this space, puts it bluntly: if you are not investing in evals, you are not doing AI engineering.⁷

What is an eval?

An eval is a test for AI output. Just as software engineers write unit tests to verify code behaviour, AI engineers write evals to verify model output. The difference is that AI output is probabilistic — the same input can produce different outputs — so evals must be designed for variability.⁸

graph LR
    I[Input] --> A[Agent]
    A --> O[Output]
    O --> E{Eval}
    E -->|Pass| S[Ship / Proceed]
    E -->|Fail| F[Diagnose & Fix]
    R[Reference / Rubric] --> E

    style E fill:#4a9ede,color:#fff
    style R fill:#5cb85c,color:#fff

The three grading methods

Anthropic’s guide to agent evaluations identifies three approaches, each with different trade-offs:⁴

Method	How it works	Strengths	Weaknesses
Code-based	String matching, regex, JSON validation, binary pass/fail	Fast, cheap, reproducible, objective	Only works for structured or verifiable outputs
Model-based	A second LLM scores the output against a rubric	Flexible, handles nuance and open-ended tasks	Non-deterministic, requires calibration
Human-based	A domain expert reviews and grades	Gold standard for quality	Expensive, slow, does not scale

The best eval suites combine all three. Use code-based graders for everything you can (did the JSON parse? are all required fields present? does the word count fall within range?). Use model-based graders for semantic quality (does this summary accurately reflect the source?). Reserve human grading for calibrating the model-based graders.⁴

Think of it like...

Code-based evals are a spell checker — fast, mechanical, catches obvious errors. Model-based evals are a peer review — subjective but informed. Human evals are the editor-in- chief — the final authority, but you cannot afford to have them review every paragraph.

Writing good evals: the practical checklist

The most common mistake is writing evals that are too vague. “Is this output good?” is not an eval. “Does this output contain exactly three competency areas, each with a description of at least two sentences, that directly map to sections 4.1-4.3 of the source directive?” — that is an eval.⁷

Start with failures. Collect 20-50 examples of real outputs that went wrong. For each, write an eval that would have caught it. This gives you an eval suite grounded in actual failure modes, not hypothetical ones.⁴

Grade outputs, not paths. Check what the agent produced, not how it got there. An eval that verifies the agent called specific tools in a specific order is too rigid and will break when the agent finds a better path.⁴

Give judges an escape hatch. When using model-based grading, include an “Unknown / Cannot determine” option. This prevents the judge from guessing when the answer is genuinely ambiguous.⁴

Calibrate with two experts. Before deploying a model-based grader, have two domain experts independently grade the same set of outputs. If they disagree frequently, the eval criteria are too ambiguous — tighten them before automating.⁷

Use pass@k and pass^k metrics. Run the agent k times on the same input. pass@k (succeeds at least once in k attempts) measures capability. pass^k (succeeds every time) measures reliability. The gap between them tells you how much variance your system has.⁴

Part 5 — Eval-driven development: evals before agents

There is a growing practice called eval-driven development (EDD) that inverts the typical build sequence. Instead of building the agent first and testing it later, you write the evals first and build the agent to pass them — the same philosophy as test-driven development, adapted for probabilistic systems.⁹

graph TD
    D[Define Success Criteria] --> W[Write Evals]
    W --> B[Build Agent]
    B --> R[Run Evals]
    R -->|Fail| I[Improve Agent or Prompt]
    I --> R
    R -->|Pass| S[Ship]
    S --> M[Monitor in Production]
    M -->|New Failure| W

    style W fill:#4a9ede,color:#fff
    style D fill:#5cb85c,color:#fff

Why evals first?

When you write evals before the agent, you are forced to define what “good” looks like before you start building. This prevents the most common failure mode in AI projects: building something that seems to work on three examples and then discovering it fails on the fourth.⁹

Red Hat describes an eight-stage maturity model for eval-driven development, from manual testing with predefined conversations to full CI/CD integration with continuous monitoring. The critical insight: teams that start with evals ship faster, because every improvement is measurable.¹⁰

Applying EDD to your workflow

Consider your diploma processing system. Before building (or refactoring) the competency extraction step, you would:

Define success: “Given directive section 4, the output must list exactly the competency domains named in the text, with no additions, omissions, or paraphrasing of domain names.”
Write the eval: A code-based check (correct count, exact domain names) plus a model-based check (does each description accurately reflect the source paragraph?).
Build the agent step to pass these evals.
Run on 20+ directives to verify consistency.

If you later change the prompt or the model, the evals catch regressions immediately. This is what makes refactoring safe in AI systems.⁷

Rule of thumb

If you cannot write an eval for a step, you do not understand the step well enough to automate it. The act of writing the eval forces you to articulate what “right” looks like — and that articulation is often the hardest and most valuable part of the work.

Part 6 — The autonomy spectrum: when to let go

Not every loop should be autonomous. The decision depends on the cost of errors, the quality of your evals, and the complexity of the judgement required. Research on AI agent autonomy identifies five levels, calibrated by the human’s role:¹¹

graph LR
    L1[Level 1<br/>Human as<br/>Operator] --> L2[Level 2<br/>Human as<br/>Collaborator]
    L2 --> L3[Level 3<br/>Human as<br/>Consultant]
    L3 --> L4[Level 4<br/>Human as<br/>Approver]
    L4 --> L5[Level 5<br/>Human as<br/>Observer]

    style L1 fill:#d9534f,color:#fff
    style L2 fill:#f0ad4e,color:#fff
    style L3 fill:#4a9ede,color:#fff
    style L4 fill:#5cb85c,color:#fff
    style L5 fill:#9b59b6,color:#fff

Level	Human role	Agent role	When to use
1. Operator	Drives every step	Provides on-demand support	Learning the task; no evals yet
2. Collaborator	Plans and reviews together	Executes and proposes	Building evals; understanding failure modes
3. Consultant	Reviews when asked	Leads planning and execution	Evals cover most cases; human handles edge cases
4. Approver	Approves only at checkpoints	Runs autonomously between checkpoints	Strong evals; low-risk intermediate steps
5. Observer	Monitors dashboards	Fully autonomous	Mature evals; reversible actions; proven track record

The practical progression looks like this: you start at Level 1 (operating the workflow manually). As you identify patterns, you encode them as evals and move to Level 2 (collaborating with the agent). As your eval suite matures, you grant more autonomy — but only for steps where the evals are strong enough to catch failures.¹¹

The human-on-the-loop pattern

The transition from human-in-the-loop (approving every decision) to human-on-the-loop (monitoring and intervening on exceptions) is where most productive agentic systems live. The agent runs autonomously but the human watches a dashboard of eval results and steps in when something flags.¹²

For your diploma workflow, this might mean:

Autonomous (Level 4-5): Extracting organisation metadata, formatting past exam questions, generating JSON structures — steps where the eval is binary (correct fields or not)
Supervised (Level 2-3): Competency extraction, module syllabus generation, backpropagation — steps where semantic accuracy matters and context is course-specific
Manual (Level 1): Final cross-validation of the complete vault, checking that the system as a whole is coherent

Progressive autonomy is not all-or-nothing

The mistake is thinking you must choose between “I do everything” and “the agent does everything.” The right design gives different levels of autonomy to different steps, based on the quality of evals available for each step. Autonomy is earned, step by step, eval by eval.

Part 7 — Designing your loop: the scaffolding decisions

Building a good agentic loop is an exercise in decomposition. You are not asking “can an agent do this entire workflow?” You are asking, for each step: “what would it take to let an agent do this unsupervised?”³

Decision framework

For each step in your workflow, answer these questions:

graph TD
    Q1{Can I write a<br/>code-based eval?} -->|Yes| A1[Automate with<br/>code-based check]
    Q1 -->|No| Q2{Can I write a<br/>model-based eval?}
    Q2 -->|Yes| Q3{Is the cost of<br/>a false pass low?}
    Q3 -->|Yes| A2[Automate with<br/>model-based eval]
    Q3 -->|No| A3[Human-on-the-loop<br/>with model eval as filter]
    Q2 -->|No| A4[Keep human<br/>in the loop]

    style Q1 fill:#4a9ede,color:#fff
    style Q2 fill:#4a9ede,color:#fff
    style Q3 fill:#4a9ede,color:#fff

Anchoring against drift

Every loop needs an anchor — a stable reference that prevents semantic drift across iterations. Without an anchor, each revision uses the previous revision as its reference, and small shifts accumulate.⁶

Effective anchoring strategies:

Source documents as context: Always include the original directive (or relevant section) in every evaluation prompt, not just the agent’s previous output
Frozen criteria: Define acceptance criteria once, store them in a file, and never let the agent modify them
Episodic checkpoints: After every N iterations, compare the current output to the original input, not to the previous iteration
context-cascading: Structure your context so that foundational rules appear at the top of every prompt and are never compressed away

Stop conditions

Every loop must have a hard stop. Without one, a loop that cannot satisfy its eval will iterate forever, consuming tokens and potentially degrading output. Anthropic’s guidance is explicit: always define maximum iterations and have the agent escalate to a human when the limit is reached.³

Stop conditions:
- Maximum 3 revision attempts per output
- If eval score decreases between iterations, stop and escalate
- If token budget exceeds threshold, checkpoint and pause
- If eval passes, advance immediately (do not over-refine)

The shift-worker pattern

For long-running, complex workflows (like your diploma vault), Anthropic recommends treating agents like shift workers. Each session starts by reading a progress file (what has been done, what is next), works on a single well-scoped task, commits its work with descriptive metadata, and updates the progress file for the next session.¹³

This pattern prevents the context degradation that occurs in very long sessions (65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning).¹⁴ It also makes the workflow resumable and auditable — every shift’s work is a discrete, reviewable unit.

The full map

graph TD
    subgraph Foundation
        CR[Compound Reliability<br/>Why drift happens]
        DT[Drift Types<br/>Semantic, coordination,<br/>behavioural]
    end

    subgraph Loop Design
        AL[Agentic Loops<br/>Plan-Execute-Evaluate-Revise]
        IL[Inner and Outer Loops<br/>Nesting for complexity]
    end

    subgraph Evaluation
        EV[Evals<br/>Code, model, human]
        EDD[Eval-Driven Development<br/>Evals before agents]
    end

    subgraph Autonomy
        PA[Progressive Autonomy<br/>5 levels]
        HOTL[Human-on-the-Loop<br/>Monitor, not approve]
    end

    CR --> AL
    DT --> EV
    AL --> IL
    EV --> EDD
    EDD --> PA
    IL --> PA
    PA --> HOTL

    style EV fill:#4a9ede,color:#fff
    style EDD fill:#5cb85c,color:#fff
    style AL fill:#f0ad4e,color:#fff

What you should understand now

After reading this path, you should be able to:

Explain why multi-step AI systems degrade without evaluations, and calculate the compound effect

Describe the four phases of an agentic loop and distinguish inner loops from outer loops

Identify the three types of drift (semantic, coordination, behavioural) and name anchoring strategies for each

Choose the right grading method (code-based, model-based, human) for a given step in your workflow

Write an eval that catches real failure modes, not hypothetical ones

Apply eval-driven development: define success criteria before building the agent

Place each step of a workflow on the autonomy spectrum, granting independence only where evals are strong enough

Check your understanding

Test yourself (click to expand)

Calculate the overall reliability of a 12-step pipeline where each step has 92% reliability. What does this tell you about the need for evaluation checkpoints?

Distinguish between semantic drift and behavioural drift. Give a concrete example of each in a document-processing workflow.

Design an eval suite for a step that extracts exam questions from a PDF. Which grading methods would you use and why?

Explain why eval-driven development inverts the typical build sequence, and why this matters more for AI systems than for traditional software.

Evaluate this claim: “A fully autonomous agent is always better than a human-in-the-loop system.” Under what conditions is this true, and when is it dangerous?

Where to go next

Path A — Build your first eval suite

If you want to start writing evals for your existing workflows, begin with Hamel Husain’s Your AI Product Needs Evals and Anthropic’s Demystifying Evals for AI Agents. Then explore evaluator-optimiser for the simplest loop pattern to implement.

Path B — Understand agent architecture more deeply

If you want to understand how production agent systems are structured, read How to Think About AI Systems and follow with Anthropic’s Building Effective Agents. Then explore orchestration and playbooks-as-programs.

Path C — Explore progressive autonomy in practice

If you want to design a system that moves from supervised to autonomous, study Anthropic’s Measuring AI Agent Autonomy and explore the human-in-the-loop and guardrails concept cards.

Explorer

How to Build AI Systems That Correct Themselves

How to Build AI Systems That Correct Themselves

Who this is for

Part 1 — Your instinct is right (and also wrong)

The compound reliability problem

But the answer is not “always supervise”

Part 2 — The anatomy of an agentic loop

The four phases

Inner loops and outer loops

Part 3 — Why loops fail: the three types of drift

Semantic drift

Coordination drift

Behavioural drift

Part 4 — Evaluations: the foundation of autonomy

What is an eval?

The three grading methods

Writing good evals: the practical checklist

Part 5 — Eval-driven development: evals before agents

Why evals first?

Applying EDD to your workflow

Part 6 — The autonomy spectrum: when to let go

The human-on-the-loop pattern

Part 7 — Designing your loop: the scaffolding decisions

Decision framework

Anchoring against drift

Stop conditions

The shift-worker pattern

The full map

What you should understand now

Check your understanding

Where to go next

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

How to Build AI Systems That Correct Themselves

How to Build AI Systems That Correct Themselves

Who this is for

Part 1 — Your instinct is right (and also wrong)

The compound reliability problem

But the answer is not “always supervise”

Part 2 — The anatomy of an agentic loop

The four phases

Inner loops and outer loops

Part 3 — Why loops fail: the three types of drift

Semantic drift

Coordination drift

Behavioural drift

Part 4 — Evaluations: the foundation of autonomy

What is an eval?

The three grading methods

Writing good evals: the practical checklist

Part 5 — Eval-driven development: evals before agents

Why evals first?

Applying EDD to your workflow

Part 6 — The autonomy spectrum: when to let go

The human-on-the-loop pattern

Part 7 — Designing your loop: the scaffolding decisions

Decision framework

Anchoring against drift

Stop conditions

The shift-worker pattern

The full map

What you should understand now

Check your understanding

Where to go next

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks