Harness Engineering

The discipline of building everything around an AI model that makes it reliable — the tools, permissions, tests, feedback loops, guardrails, and verification systems that turn a capable but unpredictable model into a trustworthy agent.

What is it?

A language model on its own is like a brilliant new hire on their first day: full of capability, no idea where the files are, no understanding of your testing process, and no one checking their work. You would not give a new hire unsupervised access to production on day one. You would surround them with onboarding documents, code review, automated tests, and a colleague who can answer questions. That surrounding system — applied to an AI model — is a harness.¹

In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) published an account of his AI adoption journey and gave the practice its name. He described a six-stage progression, with stage five labelled “Engineer the Harness.” His core equation was simple:²

Agent = Model + Harness

The model provides the intelligence. The harness makes that intelligence useful, reliable, and safe. It is everything in an AI system except the model itself: the tools it can call, the documents it reads before acting, the tests that verify its output, the guardrails that constrain its actions, and the feedback loops that enable self-correction.²

Birgitta Bockeler, Distinguished Engineer at Thoughtworks, published the most comprehensive treatment of the discipline in April 2026, formalising the concepts of guides and sensors as the two fundamental categories of harness controls.¹ The article, hosted on Martin Fowler’s site, became the de facto reference for harness engineering.

In plain terms

A harness is the system you build around an AI model so it can work reliably without constant supervision. If context-engineering is about what the model knows, harness engineering is about what the model can do, how its work is checked, and what happens when something goes wrong.

At a glance

The harness around a model (click to expand)
graph TD
    subgraph Harness
        G[Guides<br/>Feedforward] --> M[Model]
        M --> S[Sensors<br/>Feedback]
        S -->|correction signal| M
        G -.->|docs, specs, rules| G
        S -.->|tests, linters, evals| S
    end

    style M fill:#4a9ede,color:#fff
    style G fill:#5cb85c,color:#fff
    style S fill:#e8b84b,color:#fff
Key: Guides steer the model before it acts (feedforward). Sensors observe after it acts (feedback). Together they create a closed loop where the model can self-correct. The model sits at the centre; the harness surrounds it.

How does it work?

Guides — steering before action

Guides are feedforward controls: they shape the model’s behaviour before it generates output. They answer the question “what does the agent need to know and follow before it starts working?”¹

Guide type	What it does	Example
Documentation	Loads architecture, conventions, and domain knowledge	AGENTS.md, coding standards, API specs
Bootstrap scripts	Ensures the agent starts in a known-good state	Running tests to verify the environment, checking out the right branch
Constraints	Sets explicit boundaries on what the agent can and cannot do	”Only modify files in /src. Never delete tests.”
Task specifications	Defines what “done” looks like before work begins	Acceptance criteria, output format, success metrics

Hashimoto’s insight was that the same model behaves radically differently depending on what guides surround it. A model with a well-written AGENTS.md file, clear coding conventions, and a bootstrap script that runs the test suite before starting work produces dramatically better results than the same model dropped into a project with no context.²

Think of it like...

A GPS navigation system. The GPS (model) knows how to find routes. But the guides — the destination you set, the preference for avoiding tolls, the restriction against U-turns — determine which route it actually takes. Without guides, the GPS might find a route. With guides, it finds your route.

Sensors — checking after action

Sensors are feedback controls: they observe the model’s output and determine whether it meets quality standards. They answer the question “did the agent do what it was supposed to do?”¹

Bockeler identifies two categories:

Computational sensors — fast, deterministic, cheap:

Linters and formatters (does the code follow style rules?)
Type checkers (does the code compile?)
Test suites (do the tests pass?)
Schema validators (does the JSON match the expected shape?)
Regex checks (does the output contain required fields?)

Inferential sensors — slower, richer judgement, higher cost:

AI-powered code review (is the logic sound?)
Semantic similarity checks (does this summary capture the source material?)
Model-as-judge evaluations (does this output meet the rubric?)
Anomaly detection (is this output unusually different from previous outputs?)

The most effective harnesses use computational sensors first (they are fast and free) and reserve inferential sensors for checks that require subjective judgement.¹

graph LR
    O[Agent Output] --> CS[Computational Sensors]
    CS -->|Pass| IS[Inferential Sensors]
    CS -->|Fail| FB1[Fast Feedback]
    IS -->|Pass| D[Deliver]
    IS -->|Fail| FB2[Rich Feedback]
    FB1 --> R[Revise]
    FB2 --> R

    style CS fill:#5cb85c,color:#fff
    style IS fill:#e8b84b,color:#fff

Think of it like...

A car factory’s quality control. Computational sensors are the automated checks on the assembly line — does the door align to within 0.5mm? Does the paint match the colour code? Inferential sensors are the human test drivers who evaluate how the car feels to drive. Both are essential, but you run the cheap automated checks before the expensive human ones.

Custom feedback loops

The most powerful technique in Bockeler’s framework is custom linter messages. Instead of generic error outputs, the harness provides linter messages that include instructions for how to fix the issue. The model reads the feedback, understands what went wrong, and self-corrects without human intervention.¹

For example, a standard linter might say: Error: missing return type on function validateEmail

A harness-aware linter adds: Error: missing return type on function validateEmail. Add an explicit TypeScript return type annotation. Check the project's type conventions in src/types/README.md.

The second message gives the model enough context to fix the issue autonomously. This is how feedback loops become self-correcting loops.

Evaluations — the foundation

Evaluations (evals) are the single most important component of a harness. Without them, guides and sensors have nothing to measure against. An eval is a systematic test that measures whether the system produces correct, useful, safe output.³

The evaluator-optimiser pattern is the simplest implementation: one LLM generates output, another evaluates it against explicit criteria, and the generator revises based on feedback until the eval passes or an iteration limit is reached.

In production harnesses, evals operate at multiple levels:

Level	What it tests	Frequency
Step-level	Does this individual output meet criteria?	Every generation
Task-level	Does the completed task satisfy the specification?	After each task
Session-level	Is the agent’s behaviour consistent and non-degrading?	End of session
Regression	Did a system change break something that worked?	After every harness modification

The hierarchy of importance

If you can only build one thing, build evals. Evals tell you whether your guides are working. Evals tell you whether your sensors are catching real problems. Evals tell you whether a change to the model, the prompt, or the context made things better or worse. Everything else in the harness depends on the ability to measure.³

The proof: harness beats model

The most compelling evidence for harness engineering comes from LangChain’s DeepAgent project. Their coding agent moved from outside the Top 30 to the Top 5 on Terminal Bench 2.0 — an improvement of 13.7 points (from 52.8 to 66.5) — by changing only the harness while keeping the same model (GPT-5.2-Codex). The improvements included self-verification loops, loop detection, context management, and time budgeting.⁴

The implication is direct: a mediocre model in an excellent harness outperforms an excellent model with no harness. Ethan Mollick observes that “the same model can behave very differently depending on what harness it’s operating in.”⁵

Why do we use it?

Key reasons

1. Models are capable but unreliable. LLMs can generate excellent output — but also confidently wrong output. A harness catches errors before they reach users or propagate through multi-step workflows.¹

2. The harness is where reliability lives. You cannot make AI reliable by finding a better model. You make it reliable by building a system that tests, constrains, and improves the model’s output. The model is the engine; the harness is everything else.²

3. It enables progressive autonomy. Without a harness, you must supervise every step. With a strong harness — particularly strong evals — you can grant the model increasing independence, reserving human-in-the-loop checkpoints for genuinely ambiguous situations.¹

4. It makes improvements measurable. When you change a prompt, a context structure, or a model version, the harness’s evals tell you whether the change helped or hurt. Without this, optimisation is guesswork.³

When do we use it?

When deploying AI that runs without constant human supervision (agents, automated workflows)
When AI output goes to users or downstream systems where errors have consequences
When working with multi-step pipelines where errors in early steps compound
When iterating on prompts or context and needing to measure whether changes actually improve output
When building long-running agents that must maintain quality across extended sessions
When the cost of failure is high enough to justify verification infrastructure

Rule of thumb

If you would code-review a human’s work before shipping it, you should harness-check an AI’s work before trusting it. The harness is the code review process for machines.

How can I think about it?

The cockpit analogy

A pilot (model) has the skill to fly. But the cockpit (harness) provides everything that makes flying safe and reliable:

Checklists (guides) — the pre-flight checks that ensure the plane is in a known-good state before takeoff

Instruments (computational sensors) — altimeter, airspeed indicator, fuel gauge. Fast, deterministic, always running.

Radar and weather reports (inferential sensors) — richer, more interpretive information about conditions ahead

Autopilot constraints (guardrails) — limits on what the autopilot can do without human approval (no landing below visibility minimums)

Black box (logging) — records everything for post-incident analysis

Air traffic control (human-in-the-loop) — a human who intervenes for high-stakes decisions

A pilot without a cockpit can still fly. But you would not board that plane.

The theatre analogy

A harness is like a theatre production around a lead actor:

The script and director’s notes (guides) — what the actor should say and the intent behind each scene

The rehearsal process (evals) — running the scene repeatedly, checking against quality criteria, refining

The stage manager (sensors) — monitors timing, cues, lighting, and flags when something is off

The safety net below the trapeze (guardrails) — limits the consequences of a mistake

The understudy (fallback) — a backup that can step in if the lead fails

The actor’s talent matters, but a brilliant actor with no script, no rehearsal, and no stage manager produces chaos, not art.

Concepts to explore next

Concept	What it covers	Status
evaluator-optimiser	The generate-evaluate-revise loop pattern	complete
guardrails	Constraints that limit what an agent can accept, produce, and do	complete
human-in-the-loop	Checkpoints where a human reviews and approves before the system continues	complete
context-engineering	The practice of curating what the model sees — the information layer the harness manages	complete
computational-sensors	Fast, deterministic checks: linters, tests, validators	stub
inferential-sensors	AI-powered checks: model-as-judge, semantic analysis	stub
bootstrap-scripts	Automated setup that ensures the agent starts in a known-good state	stub

Some of these cards don't exist yet

They’ll be created as the knowledge system grows. A broken link is a placeholder for future learning, not an error.

Check your understanding

Test yourself (click to expand)

Explain the equation “Agent = Model + Harness.” Why does the harness matter more than the model for reliability?

Distinguish between guides and sensors. Give two examples of each for an AI agent that writes marketing copy.

Name the two categories of sensors (computational and inferential) and explain when you would use each.

Interpret this scenario: LangChain’s DeepAgent improved 13.7 points on a coding benchmark by changing only the harness, not the model. What does this tell you about where to invest effort when building AI systems?

Design a minimal harness for an AI agent that processes customer support tickets. Include at least two guides, two sensors (one computational, one inferential), and one guardrail.

Where this concept fits

Position in the knowledge graph
graph TD
    AS[Agentic Systems] --> HE[Harness Engineering]
    AS --> CE[Context Engineering]
    AS --> LP[LLM Pipelines]
    AS --> GR[Guardrails]
    HE --> CS[Computational Sensors]
    HE --> IS[Inferential Sensors]
    HE --> BS[Bootstrap Scripts]
    HE -.->|measures with| EO[Evaluator-Optimiser]
    HE -.->|manages| CE
    style HE fill:#4a9ede,color:#fff
Related concepts:

context-engineering — the information layer that the harness manages; context is what the model sees, the harness governs how it is curated and validated

evaluator-optimiser — the simplest harness loop pattern: generate, evaluate, revise

guardrails — the constraint layer within the harness that limits what the agent can do

human-in-the-loop — the escalation mechanism when the harness detects situations it cannot resolve autonomously

Explorer

Harness Engineering

Harness Engineering

What is it?

At a glance

How does it work?

Guides — steering before action

Sensors — checking after action

Custom feedback loops

Evaluations — the foundation

The proof: harness beats model

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Harness Engineering

Harness Engineering

What is it?

At a glance

How does it work?

Guides — steering before action

Sensors — checking after action

Custom feedback loops

Evaluations — the foundation

The proof: harness beats model

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks