Harness Engineering

The discipline of building everything around an AI model that makes it reliable — the tools, permissions, tests, feedback loops, guardrails, and verification systems that turn a capable but unpredictable model into a trustworthy agent.


What is it?

A language model on its own is like a brilliant new hire on their first day: full of capability, no idea where the files are, no understanding of your testing process, and no one checking their work. You would not give a new hire unsupervised access to production on day one. You would surround them with onboarding documents, code review, automated tests, and a colleague who can answer questions. That surrounding system — applied to an AI model — is a harness.1

In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) published an account of his AI adoption journey and gave the practice its name. He described a six-stage progression, with stage five labelled “Engineer the Harness.” His core equation was simple:2

Agent = Model + Harness

The model provides the intelligence. The harness makes that intelligence useful, reliable, and safe. It is everything in an AI system except the model itself: the tools it can call, the documents it reads before acting, the tests that verify its output, the guardrails that constrain its actions, and the feedback loops that enable self-correction.2

Birgitta Bockeler, Distinguished Engineer at Thoughtworks, published the most comprehensive treatment of the discipline in April 2026, formalising the concepts of guides and sensors as the two fundamental categories of harness controls.1 The article, hosted on Martin Fowler’s site, became the de facto reference for harness engineering.

In plain terms

A harness is the system you build around an AI model so it can work reliably without constant supervision. If context-engineering is about what the model knows, harness engineering is about what the model can do, how its work is checked, and what happens when something goes wrong.


At a glance


How does it work?

Guides — steering before action

Guides are feedforward controls: they shape the model’s behaviour before it generates output. They answer the question “what does the agent need to know and follow before it starts working?”1

Guide typeWhat it doesExample
DocumentationLoads architecture, conventions, and domain knowledgeAGENTS.md, coding standards, API specs
Bootstrap scriptsEnsures the agent starts in a known-good stateRunning tests to verify the environment, checking out the right branch
ConstraintsSets explicit boundaries on what the agent can and cannot do”Only modify files in /src. Never delete tests.”
Task specificationsDefines what “done” looks like before work beginsAcceptance criteria, output format, success metrics

Hashimoto’s insight was that the same model behaves radically differently depending on what guides surround it. A model with a well-written AGENTS.md file, clear coding conventions, and a bootstrap script that runs the test suite before starting work produces dramatically better results than the same model dropped into a project with no context.2

Think of it like...

A GPS navigation system. The GPS (model) knows how to find routes. But the guides — the destination you set, the preference for avoiding tolls, the restriction against U-turns — determine which route it actually takes. Without guides, the GPS might find a route. With guides, it finds your route.

Sensors — checking after action

Sensors are feedback controls: they observe the model’s output and determine whether it meets quality standards. They answer the question “did the agent do what it was supposed to do?”1

Bockeler identifies two categories:

Computational sensors — fast, deterministic, cheap:

  • Linters and formatters (does the code follow style rules?)
  • Type checkers (does the code compile?)
  • Test suites (do the tests pass?)
  • Schema validators (does the JSON match the expected shape?)
  • Regex checks (does the output contain required fields?)

Inferential sensors — slower, richer judgement, higher cost:

  • AI-powered code review (is the logic sound?)
  • Semantic similarity checks (does this summary capture the source material?)
  • Model-as-judge evaluations (does this output meet the rubric?)
  • Anomaly detection (is this output unusually different from previous outputs?)

The most effective harnesses use computational sensors first (they are fast and free) and reserve inferential sensors for checks that require subjective judgement.1

graph LR
    O[Agent Output] --> CS[Computational Sensors]
    CS -->|Pass| IS[Inferential Sensors]
    CS -->|Fail| FB1[Fast Feedback]
    IS -->|Pass| D[Deliver]
    IS -->|Fail| FB2[Rich Feedback]
    FB1 --> R[Revise]
    FB2 --> R

    style CS fill:#5cb85c,color:#fff
    style IS fill:#e8b84b,color:#fff

Think of it like...

A car factory’s quality control. Computational sensors are the automated checks on the assembly line — does the door align to within 0.5mm? Does the paint match the colour code? Inferential sensors are the human test drivers who evaluate how the car feels to drive. Both are essential, but you run the cheap automated checks before the expensive human ones.

Custom feedback loops

The most powerful technique in Bockeler’s framework is custom linter messages. Instead of generic error outputs, the harness provides linter messages that include instructions for how to fix the issue. The model reads the feedback, understands what went wrong, and self-corrects without human intervention.1

For example, a standard linter might say: Error: missing return type on function validateEmail

A harness-aware linter adds: Error: missing return type on function validateEmail. Add an explicit TypeScript return type annotation. Check the project's type conventions in src/types/README.md.

The second message gives the model enough context to fix the issue autonomously. This is how feedback loops become self-correcting loops.

Evaluations — the foundation

Evaluations (evals) are the single most important component of a harness. Without them, guides and sensors have nothing to measure against. An eval is a systematic test that measures whether the system produces correct, useful, safe output.3

The evaluator-optimiser pattern is the simplest implementation: one LLM generates output, another evaluates it against explicit criteria, and the generator revises based on feedback until the eval passes or an iteration limit is reached.

In production harnesses, evals operate at multiple levels:

LevelWhat it testsFrequency
Step-levelDoes this individual output meet criteria?Every generation
Task-levelDoes the completed task satisfy the specification?After each task
Session-levelIs the agent’s behaviour consistent and non-degrading?End of session
RegressionDid a system change break something that worked?After every harness modification

The hierarchy of importance

If you can only build one thing, build evals. Evals tell you whether your guides are working. Evals tell you whether your sensors are catching real problems. Evals tell you whether a change to the model, the prompt, or the context made things better or worse. Everything else in the harness depends on the ability to measure.3

The proof: harness beats model

The most compelling evidence for harness engineering comes from LangChain’s DeepAgent project. Their coding agent moved from outside the Top 30 to the Top 5 on Terminal Bench 2.0 — an improvement of 13.7 points (from 52.8 to 66.5) — by changing only the harness while keeping the same model (GPT-5.2-Codex). The improvements included self-verification loops, loop detection, context management, and time budgeting.4

The implication is direct: a mediocre model in an excellent harness outperforms an excellent model with no harness. Ethan Mollick observes that “the same model can behave very differently depending on what harness it’s operating in.”5


Why do we use it?

Key reasons

1. Models are capable but unreliable. LLMs can generate excellent output — but also confidently wrong output. A harness catches errors before they reach users or propagate through multi-step workflows.1

2. The harness is where reliability lives. You cannot make AI reliable by finding a better model. You make it reliable by building a system that tests, constrains, and improves the model’s output. The model is the engine; the harness is everything else.2

3. It enables progressive autonomy. Without a harness, you must supervise every step. With a strong harness — particularly strong evals — you can grant the model increasing independence, reserving human-in-the-loop checkpoints for genuinely ambiguous situations.1

4. It makes improvements measurable. When you change a prompt, a context structure, or a model version, the harness’s evals tell you whether the change helped or hurt. Without this, optimisation is guesswork.3


When do we use it?

  • When deploying AI that runs without constant human supervision (agents, automated workflows)
  • When AI output goes to users or downstream systems where errors have consequences
  • When working with multi-step pipelines where errors in early steps compound
  • When iterating on prompts or context and needing to measure whether changes actually improve output
  • When building long-running agents that must maintain quality across extended sessions
  • When the cost of failure is high enough to justify verification infrastructure

Rule of thumb

If you would code-review a human’s work before shipping it, you should harness-check an AI’s work before trusting it. The harness is the code review process for machines.


How can I think about it?

The cockpit analogy

A pilot (model) has the skill to fly. But the cockpit (harness) provides everything that makes flying safe and reliable:

  • Checklists (guides) — the pre-flight checks that ensure the plane is in a known-good state before takeoff
  • Instruments (computational sensors) — altimeter, airspeed indicator, fuel gauge. Fast, deterministic, always running.
  • Radar and weather reports (inferential sensors) — richer, more interpretive information about conditions ahead
  • Autopilot constraints (guardrails) — limits on what the autopilot can do without human approval (no landing below visibility minimums)
  • Black box (logging) — records everything for post-incident analysis
  • Air traffic control (human-in-the-loop) — a human who intervenes for high-stakes decisions

A pilot without a cockpit can still fly. But you would not board that plane.

The theatre analogy

A harness is like a theatre production around a lead actor:

  • The script and director’s notes (guides) — what the actor should say and the intent behind each scene
  • The rehearsal process (evals) — running the scene repeatedly, checking against quality criteria, refining
  • The stage manager (sensors) — monitors timing, cues, lighting, and flags when something is off
  • The safety net below the trapeze (guardrails) — limits the consequences of a mistake
  • The understudy (fallback) — a backup that can step in if the lead fails

The actor’s talent matters, but a brilliant actor with no script, no rehearsal, and no stage manager produces chaos, not art.


Concepts to explore next

ConceptWhat it coversStatus
evaluator-optimiserThe generate-evaluate-revise loop patterncomplete
guardrailsConstraints that limit what an agent can accept, produce, and docomplete
human-in-the-loopCheckpoints where a human reviews and approves before the system continuescomplete
context-engineeringThe practice of curating what the model sees — the information layer the harness managescomplete
computational-sensorsFast, deterministic checks: linters, tests, validatorsstub
inferential-sensorsAI-powered checks: model-as-judge, semantic analysisstub
bootstrap-scriptsAutomated setup that ensures the agent starts in a known-good statestub

Some of these cards don't exist yet

They’ll be created as the knowledge system grows. A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AS[Agentic Systems] --> HE[Harness Engineering]
    AS --> CE[Context Engineering]
    AS --> LP[LLM Pipelines]
    AS --> GR[Guardrails]
    HE --> CS[Computational Sensors]
    HE --> IS[Inferential Sensors]
    HE --> BS[Bootstrap Scripts]
    HE -.->|measures with| EO[Evaluator-Optimiser]
    HE -.->|manages| CE
    style HE fill:#4a9ede,color:#fff

Related concepts:

  • context-engineering — the information layer that the harness manages; context is what the model sees, the harness governs how it is curated and validated
  • evaluator-optimiser — the simplest harness loop pattern: generate, evaluate, revise
  • guardrails — the constraint layer within the harness that limits what the agent can do
  • human-in-the-loop — the escalation mechanism when the harness detects situations it cannot resolve autonomously

Sources


Further reading

Resources

Footnotes

  1. Bockeler, B. (2026). Harness Engineering for Coding Agent Users. Martin Fowler / Thoughtworks. The most comprehensive treatment of guides, sensors, and harness categories. 2 3 4 5 6 7 8

  2. Hashimoto, M. (2026). My AI Adoption Journey. mitchellh.com. The blog post that named harness engineering and proposed the Agent = Model + Harness equation. 2 3 4

  3. Anthropic. (2026). Demystifying Evals for AI Agents. Anthropic Engineering. Comprehensive guide to grading methods, metrics, and eval design. 2 3

  4. LangChain. (2026). Improving Deep Agents with Harness Engineering. LangChain Blog. Demonstrates a jump from outside Top 30 to Top 5 on Terminal Bench 2.0 by improving only the harness.

  5. Mollick, E. (2026). A Guide to Which AI to Use in the Agentic Era. One Useful Thing. Frames the Model/App/Harness distinction for a general audience.