Evaluator-Optimiser

A pipeline pattern where one LLM generates output, another evaluates it against quality criteria, and the generator revises until the output passes or an iteration limit is reached.


What is it?

prompt-chaining processes tasks in a straight line: step 1 feeds step 2 feeds step 3, moving forward without looking back. This works when the first attempt at each step is good enough. But for quality-sensitive tasks — literary translation, legal drafting, code generation with test suites — the first attempt is rarely good enough. You need a way to improve the output through iteration, not just check it with a pass/fail gate.1

The evaluator-optimiser pattern adds this iterative refinement. Instead of a single forward pass, it creates a loop: one LLM call generates output, a second LLM call evaluates that output against explicit criteria and provides detailed feedback, and the generator uses that feedback to produce an improved version. The cycle repeats until the evaluator is satisfied or a maximum number of iterations is reached.1

Anthropic’s Building Effective Agents guide identifies this as a distinct workflow pattern, describing it as “particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value.” The two signs of good fit are: first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback autonomously.1

The parent concept, llm-pipelines, frames the evaluator-optimiser as one of several pipeline patterns. Where prompt-chaining is the simplest pattern (a straight line) and parallelisation handles independent subtasks concurrently, the evaluator-optimiser handles tasks where quality emerges through iterative refinement rather than from a single pass.

In plain terms

Think of writing an essay with a tutor looking over your shoulder. You write a draft, the tutor reads it, marks up what needs improvement (“this paragraph is unclear,” “this claim needs evidence”), and hands it back. You revise and resubmit. The tutor checks again. This continues until the tutor says the essay is ready — or until you run out of revision rounds.


At a glance


How does it work?

1. The generator — producing the initial output

The generator is an LLM call whose job is to produce output for a given task. On the first iteration, it works from the original input. On subsequent iterations, it receives the original input plus the evaluator’s feedback from the previous round.2

The generator’s prompt should be focused on the task itself, not on self-evaluation. Trying to generate and evaluate simultaneously splits the model’s attention and degrades both capabilities. Separating the two roles into distinct LLM calls is the core design insight of this pattern.2

Think of it like...

A writer drafting a chapter. The writer’s job is to write the best version they can, focusing entirely on the content. They do not simultaneously try to be their own editor — that is someone else’s job. On revision rounds, the writer receives the editor’s notes and addresses each one specifically.


2. The evaluator — scoring and providing feedback

The evaluator is a separate LLM call that receives the generator’s output and assesses it against predefined criteria. The evaluator’s output is not just a pass/fail decision — it provides specific, actionable feedback that the generator can use to improve.1

The quality of the evaluator’s criteria directly determines the quality of the refinement loop. There are three common approaches to defining evaluation criteria:2

Criteria typeWhen to useExample
RubricMulti-dimensional assessment with granular scoresScore clarity (1-5), accuracy (1-5), completeness (1-5); pass if all scores are 4 or above
ChecklistWhen specific requirements must all be metDoes the output include a summary? Are all claims cited? Is the word count within range?
Binary pass/failSimple quality gatesDoes the code pass all unit tests? Does the translation preserve the original meaning?

The evaluator should output both a decision (pass or fail) and detailed feedback explaining what needs to change. Vague feedback (“make it better”) produces vague revisions. Specific feedback (“paragraph 3 makes an unsupported claim about market size — add a citation or remove the claim”) produces targeted improvements.3


3. The feedback loop — iterative refinement

The power of this pattern comes from the feedback loop itself. Each iteration narrows the gap between the current output and the quality target. The generator does not start from scratch on each iteration — it receives its previous output alongside the evaluator’s feedback and makes targeted revisions.2

This is structurally different from a validation gate in a prompt chain. A gate is binary: pass or fail. If the output fails, the step retries from scratch or halts. The evaluator-optimiser, by contrast, provides directional feedback — it tells the generator what is wrong and how to fix it, enabling incremental improvement rather than blind retry.1

Research on iterative refinement with LLMs shows that this approach reliably improves output quality across diverse tasks — code generation, translation, summarisation, and creative writing — because each iteration addresses specific deficiencies identified by the evaluator rather than hoping that a fresh attempt will randomly produce a better result.4

Key distinction

A gate asks “is this good enough?” and answers yes or no. An evaluator asks “what specifically is wrong and how should it be fixed?” Gates catch problems. Evaluators diagnose and guide the fix. Start with gates (simpler, cheaper). Upgrade to an evaluator-optimiser loop when gate failures are frequent and the fixes require nuanced judgement.1


4. Circuit breakers — preventing infinite loops

Without a maximum iteration count, a strict evaluator could keep rejecting output indefinitely — the generator and evaluator locked in an endless cycle of revision.2 Circuit breakers solve this by imposing a hard cap on iterations.

When the iteration limit is reached without passing evaluation, the system has two options:

  1. Return the best attempt so far with a quality flag indicating it did not fully pass. This is appropriate when partial quality is better than no output.
  2. Escalate to human review. The system surfaces the latest draft alongside the evaluator’s feedback, letting a human make the final call.3

In practice, most evaluator-optimiser loops converge within 2-3 iterations. If the loop regularly hits its cap, the evaluation criteria may be too strict, the generator’s prompt may need improvement, or the task may be too difficult for the model to handle without human guidance.2

Think of it like...

A student submitting a university assignment. The professor allows up to 3 resubmissions. Each resubmission includes the professor’s feedback from the previous version. If after 3 attempts the work still does not meet the standard, the student meets with the professor to discuss (human escalation). Without the resubmission cap, the student might endlessly revise a single assignment instead of moving forward.


5. Where human review intersects with automated evaluation

The evaluator-optimiser pattern does not replace human judgement — it automates the routine layers of quality assessment so that humans focus on what requires genuine expertise.3

A common production architecture layers automated and human evaluation:

LayerWhat it checksHow
Automated evaluator (LLM)Structural criteria: format, completeness, factual consistencyLLM call with criteria rubric
Automated evaluator (code)Deterministic criteria: test cases pass, word count within range, no prohibited termsProgrammatic checks
Human reviewerSubjective criteria: appropriateness, strategic alignment, nuanceHuman-in-the-loop checkpoint

The automated layers handle the bulk of iterations. The human layer handles the cases that require judgement beyond what the LLM can reliably provide. This human-in-the-loop integration means the evaluator-optimiser loop does not need to be perfect — it needs to be good enough to reduce the human reviewer’s workload to the cases that genuinely need human attention.3


Why do we use it?

Key reasons

1. Quality through iteration. For quality-sensitive tasks, iterative refinement consistently outperforms single-shot generation. The first draft captures the broad strokes; subsequent iterations address specific deficiencies identified by the evaluator.1

2. Explicit quality criteria. The pattern forces you to define what “good enough” means before generation starts. This clarity benefits the entire pipeline — the generator knows what to aim for, the evaluator knows what to check, and the human reviewer knows what has already been verified.2

3. Separation of concerns. Generation and evaluation are fundamentally different cognitive tasks. Separating them into distinct LLM calls allows each to focus fully on its role. An LLM trying to generate and self-evaluate simultaneously compromises on both.2

4. Auditability. Every iteration produces an inspectable artifact: the draft, the evaluation, and the feedback. If the final output has a problem, you can trace back through the iteration history to understand how it was (or was not) caught.3


When do we use it?

  • When quality criteria are clear and measurable — rubrics, checklists, test suites, style guides
  • When a first draft is reliably improvable with specific feedback (the LLM can act on critique)
  • When human feedback demonstrably improves output — if a human can articulate what is wrong and the result improves, an LLM evaluator can likely do the same1
  • When the task involves creative or nuanced generation — translation, writing, code — where the gap between first draft and polished output is significant
  • When compliance or accuracy standards require verification before output is used

Rule of thumb

If you find yourself repeatedly asking the model to “try again” or “make it better” and the output actually improves each time, you have identified a task that would benefit from an evaluator-optimiser loop. Automate the critique and the iteration.1


How can I think about it?

The editor and the writer

A publishing house does not ship the writer’s first draft. The process works like this:

  • The writer produces a manuscript (generator)
  • The editor reads it and marks it up with specific feedback: “chapter 3 drags, tighten the pacing,” “the protagonist’s motivation is unclear in scene 7,” “this dialogue feels stilted” (evaluator)
  • The writer revises based on the editor’s notes and resubmits (refinement loop)
  • The cycle repeats — typically 2-3 rounds — until the editor approves (pass criteria)
  • If the manuscript still is not ready after the agreed number of rounds, the editor and writer meet to discuss the path forward (circuit breaker / human escalation)

The writer focuses on writing. The editor focuses on quality. Neither tries to do the other’s job. The manuscript improves with each round because the feedback is specific and actionable.

The pottery wheel

A potter shaping a bowl on a wheel works in iterative refinement cycles:

  • Shape the clay with your hands (generate)
  • Step back and look at the form — is it symmetrical? Is the wall thickness even? Is the rim smooth? (evaluate)
  • Adjust based on what you see — thin out a thick spot, smooth a wobble, widen the opening (refine)
  • Repeat until the form is right or the clay has been worked too long and needs to rest (circuit breaker)

The potter does not try to create a perfect bowl in one motion. The quality emerges from the cycle of shaping, assessing, and adjusting. Each cycle addresses specific imperfections. And there is a natural limit — overworking the clay makes it worse, just as too many LLM iterations can degrade rather than improve output.


Concepts to explore next

ConceptWhat it coversStatus
parallelisationRunning independent subtasks concurrently instead of sequentiallycomplete
human-in-the-loopIntegrating human judgement at checkpoints within automated pipelinesstub
guardrailsSafety constraints that bound what an LLM can produce or act onstub

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AS[Agentic Systems] --> LP[LLM Pipelines]
    LP --> PC[Prompt Chaining]
    LP --> PR[Prompt Routing]
    LP --> PAR[Parallelisation]
    LP --> EO[Evaluator-Optimiser]
    LP --> CC[Context Cascading]
    LP --> RAG[RAG]
    style EO fill:#4a9ede,color:#fff

Related concepts:

  • parallelisation — where the evaluator-optimiser iterates for quality, parallelisation runs tasks concurrently for speed; the two patterns are complementary
  • human-in-the-loop — human review is the escalation path when automated evaluation reaches its limits or when subjective judgement is required
  • guardrails — guardrails constrain what the generator can produce; the evaluator checks whether those constraints (and others) are met

Sources


Further reading

Resources

Footnotes

  1. Schluntz, E. and Zhang, B. (2024). Building Effective Agents. Anthropic. 2 3 4 5 6 7 8 9

  2. Hopx.ai. (2025). Evaluator-Optimizer Loop: Continuous AI Agent Improvement. Hopx.ai. 2 3 4 5 6 7 8

  3. Carpintero, D. (2025). Design Patterns for Building Agentic Workflows. Hugging Face. 2 3 4 5

  4. Phoenix, J. (2026). The Evaluator-Optimizer Loop: Evolutionary Search for Anything You Can Judge. Understanding Data. 2