Harness Engineering
The discipline of building everything around an AI model that makes it reliable — the tools, permissions, tests, feedback loops, guardrails, and verification systems that turn a capable but unpredictable model into a trustworthy agent.
What is it?
A language model on its own is like a brilliant new hire on their first day: full of capability, no idea where the files are, no understanding of your testing process, and no one checking their work. You would not give a new hire unsupervised access to production on day one. You would surround them with onboarding documents, code review, automated tests, and a colleague who can answer questions. That surrounding system — applied to an AI model — is a harness.1
In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) published an account of his AI adoption journey and gave the practice its name. He described a six-stage progression, with stage five labelled “Engineer the Harness.” His core equation was simple:2
Agent = Model + Harness
The model provides the intelligence. The harness makes that intelligence useful, reliable, and safe. It is everything in an AI system except the model itself: the tools it can call, the documents it reads before acting, the tests that verify its output, the guardrails that constrain its actions, and the feedback loops that enable self-correction.2
Birgitta Bockeler, Distinguished Engineer at Thoughtworks, published the most comprehensive treatment of the discipline in April 2026, formalising the concepts of guides and sensors as the two fundamental categories of harness controls.1 The article, hosted on Martin Fowler’s site, became the de facto reference for harness engineering.
In plain terms
A harness is the system you build around an AI model so it can work reliably without constant supervision. If context-engineering is about what the model knows, harness engineering is about what the model can do, how its work is checked, and what happens when something goes wrong.
At a glance
The harness around a model (click to expand)
graph TD subgraph Harness G[Guides<br/>Feedforward] --> M[Model] M --> S[Sensors<br/>Feedback] S -->|correction signal| M G -.->|docs, specs, rules| G S -.->|tests, linters, evals| S end style M fill:#4a9ede,color:#fff style G fill:#5cb85c,color:#fff style S fill:#e8b84b,color:#fffKey: Guides steer the model before it acts (feedforward). Sensors observe after it acts (feedback). Together they create a closed loop where the model can self-correct. The model sits at the centre; the harness surrounds it.
How does it work?
Guides — steering before action
Guides are feedforward controls: they shape the model’s behaviour before it generates output. They answer the question “what does the agent need to know and follow before it starts working?”1
| Guide type | What it does | Example |
|---|---|---|
| Documentation | Loads architecture, conventions, and domain knowledge | AGENTS.md, coding standards, API specs |
| Bootstrap scripts | Ensures the agent starts in a known-good state | Running tests to verify the environment, checking out the right branch |
| Constraints | Sets explicit boundaries on what the agent can and cannot do | ”Only modify files in /src. Never delete tests.” |
| Task specifications | Defines what “done” looks like before work begins | Acceptance criteria, output format, success metrics |
Hashimoto’s insight was that the same model behaves radically differently depending on what guides surround it. A model with a well-written AGENTS.md file, clear coding conventions, and a bootstrap script that runs the test suite before starting work produces dramatically better results than the same model dropped into a project with no context.2
Think of it like...
A GPS navigation system. The GPS (model) knows how to find routes. But the guides — the destination you set, the preference for avoiding tolls, the restriction against U-turns — determine which route it actually takes. Without guides, the GPS might find a route. With guides, it finds your route.
Sensors — checking after action
Sensors are feedback controls: they observe the model’s output and determine whether it meets quality standards. They answer the question “did the agent do what it was supposed to do?”1
Bockeler identifies two categories:
Computational sensors — fast, deterministic, cheap:
- Linters and formatters (does the code follow style rules?)
- Type checkers (does the code compile?)
- Test suites (do the tests pass?)
- Schema validators (does the JSON match the expected shape?)
- Regex checks (does the output contain required fields?)
Inferential sensors — slower, richer judgement, higher cost:
- AI-powered code review (is the logic sound?)
- Semantic similarity checks (does this summary capture the source material?)
- Model-as-judge evaluations (does this output meet the rubric?)
- Anomaly detection (is this output unusually different from previous outputs?)
The most effective harnesses use computational sensors first (they are fast and free) and reserve inferential sensors for checks that require subjective judgement.1
graph LR O[Agent Output] --> CS[Computational Sensors] CS -->|Pass| IS[Inferential Sensors] CS -->|Fail| FB1[Fast Feedback] IS -->|Pass| D[Deliver] IS -->|Fail| FB2[Rich Feedback] FB1 --> R[Revise] FB2 --> R style CS fill:#5cb85c,color:#fff style IS fill:#e8b84b,color:#fff
Think of it like...
A car factory’s quality control. Computational sensors are the automated checks on the assembly line — does the door align to within 0.5mm? Does the paint match the colour code? Inferential sensors are the human test drivers who evaluate how the car feels to drive. Both are essential, but you run the cheap automated checks before the expensive human ones.
Custom feedback loops
The most powerful technique in Bockeler’s framework is custom linter messages. Instead of generic error outputs, the harness provides linter messages that include instructions for how to fix the issue. The model reads the feedback, understands what went wrong, and self-corrects without human intervention.1
For example, a standard linter might say:
Error: missing return type on function validateEmail
A harness-aware linter adds:
Error: missing return type on function validateEmail. Add an explicit TypeScript return type annotation. Check the project's type conventions in src/types/README.md.
The second message gives the model enough context to fix the issue autonomously. This is how feedback loops become self-correcting loops.
Evaluations — the foundation
Evaluations (evals) are the single most important component of a harness. Without them, guides and sensors have nothing to measure against. An eval is a systematic test that measures whether the system produces correct, useful, safe output.3
The evaluator-optimiser pattern is the simplest implementation: one LLM generates output, another evaluates it against explicit criteria, and the generator revises based on feedback until the eval passes or an iteration limit is reached.
In production harnesses, evals operate at multiple levels:
| Level | What it tests | Frequency |
|---|---|---|
| Step-level | Does this individual output meet criteria? | Every generation |
| Task-level | Does the completed task satisfy the specification? | After each task |
| Session-level | Is the agent’s behaviour consistent and non-degrading? | End of session |
| Regression | Did a system change break something that worked? | After every harness modification |
The hierarchy of importance
If you can only build one thing, build evals. Evals tell you whether your guides are working. Evals tell you whether your sensors are catching real problems. Evals tell you whether a change to the model, the prompt, or the context made things better or worse. Everything else in the harness depends on the ability to measure.3
The proof: harness beats model
The most compelling evidence for harness engineering comes from LangChain’s DeepAgent project. Their coding agent moved from outside the Top 30 to the Top 5 on Terminal Bench 2.0 — an improvement of 13.7 points (from 52.8 to 66.5) — by changing only the harness while keeping the same model (GPT-5.2-Codex). The improvements included self-verification loops, loop detection, context management, and time budgeting.4
The implication is direct: a mediocre model in an excellent harness outperforms an excellent model with no harness. Ethan Mollick observes that “the same model can behave very differently depending on what harness it’s operating in.”5
Why do we use it?
Key reasons
1. Models are capable but unreliable. LLMs can generate excellent output — but also confidently wrong output. A harness catches errors before they reach users or propagate through multi-step workflows.1
2. The harness is where reliability lives. You cannot make AI reliable by finding a better model. You make it reliable by building a system that tests, constrains, and improves the model’s output. The model is the engine; the harness is everything else.2
3. It enables progressive autonomy. Without a harness, you must supervise every step. With a strong harness — particularly strong evals — you can grant the model increasing independence, reserving human-in-the-loop checkpoints for genuinely ambiguous situations.1
4. It makes improvements measurable. When you change a prompt, a context structure, or a model version, the harness’s evals tell you whether the change helped or hurt. Without this, optimisation is guesswork.3
When do we use it?
- When deploying AI that runs without constant human supervision (agents, automated workflows)
- When AI output goes to users or downstream systems where errors have consequences
- When working with multi-step pipelines where errors in early steps compound
- When iterating on prompts or context and needing to measure whether changes actually improve output
- When building long-running agents that must maintain quality across extended sessions
- When the cost of failure is high enough to justify verification infrastructure
Rule of thumb
If you would code-review a human’s work before shipping it, you should harness-check an AI’s work before trusting it. The harness is the code review process for machines.
How can I think about it?
The cockpit analogy
A pilot (model) has the skill to fly. But the cockpit (harness) provides everything that makes flying safe and reliable:
- Checklists (guides) — the pre-flight checks that ensure the plane is in a known-good state before takeoff
- Instruments (computational sensors) — altimeter, airspeed indicator, fuel gauge. Fast, deterministic, always running.
- Radar and weather reports (inferential sensors) — richer, more interpretive information about conditions ahead
- Autopilot constraints (guardrails) — limits on what the autopilot can do without human approval (no landing below visibility minimums)
- Black box (logging) — records everything for post-incident analysis
- Air traffic control (human-in-the-loop) — a human who intervenes for high-stakes decisions
A pilot without a cockpit can still fly. But you would not board that plane.
The theatre analogy
A harness is like a theatre production around a lead actor:
- The script and director’s notes (guides) — what the actor should say and the intent behind each scene
- The rehearsal process (evals) — running the scene repeatedly, checking against quality criteria, refining
- The stage manager (sensors) — monitors timing, cues, lighting, and flags when something is off
- The safety net below the trapeze (guardrails) — limits the consequences of a mistake
- The understudy (fallback) — a backup that can step in if the lead fails
The actor’s talent matters, but a brilliant actor with no script, no rehearsal, and no stage manager produces chaos, not art.
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| evaluator-optimiser | The generate-evaluate-revise loop pattern | complete |
| guardrails | Constraints that limit what an agent can accept, produce, and do | complete |
| human-in-the-loop | Checkpoints where a human reviews and approves before the system continues | complete |
| context-engineering | The practice of curating what the model sees — the information layer the harness manages | complete |
| computational-sensors | Fast, deterministic checks: linters, tests, validators | stub |
| inferential-sensors | AI-powered checks: model-as-judge, semantic analysis | stub |
| bootstrap-scripts | Automated setup that ensures the agent starts in a known-good state | stub |
Some of these cards don't exist yet
They’ll be created as the knowledge system grows. A broken link is a placeholder for future learning, not an error.
Check your understanding
Test yourself (click to expand)
- Explain the equation “Agent = Model + Harness.” Why does the harness matter more than the model for reliability?
- Distinguish between guides and sensors. Give two examples of each for an AI agent that writes marketing copy.
- Name the two categories of sensors (computational and inferential) and explain when you would use each.
- Interpret this scenario: LangChain’s DeepAgent improved 13.7 points on a coding benchmark by changing only the harness, not the model. What does this tell you about where to invest effort when building AI systems?
- Design a minimal harness for an AI agent that processes customer support tickets. Include at least two guides, two sensors (one computational, one inferential), and one guardrail.
Where this concept fits
Position in the knowledge graph
graph TD AS[Agentic Systems] --> HE[Harness Engineering] AS --> CE[Context Engineering] AS --> LP[LLM Pipelines] AS --> GR[Guardrails] HE --> CS[Computational Sensors] HE --> IS[Inferential Sensors] HE --> BS[Bootstrap Scripts] HE -.->|measures with| EO[Evaluator-Optimiser] HE -.->|manages| CE style HE fill:#4a9ede,color:#fffRelated concepts:
- context-engineering — the information layer that the harness manages; context is what the model sees, the harness governs how it is curated and validated
- evaluator-optimiser — the simplest harness loop pattern: generate, evaluate, revise
- guardrails — the constraint layer within the harness that limits what the agent can do
- human-in-the-loop — the escalation mechanism when the harness detects situations it cannot resolve autonomously
Sources
Further reading
Resources
- Harness Engineering for Coding Agent Users (Bockeler / Thoughtworks) — The most comprehensive technical treatment of guides, sensors, and harness design
- My AI Adoption Journey (Hashimoto) — The six-stage journey that named the discipline, from solo prompting to team-scale harness engineering
- The Anatomy of an Agent Harness (LangChain) — How LangChain structures harnesses for production agents
- Better Harness: A Recipe for Harness Hill-Climbing with Evals (LangChain) — Practical guide to iteratively improving a harness using evaluation results
- Harness Engineering: The Missing Layer Behind AI Agents (Bouchard) — Accessible overview that contextualises harness engineering within the prompt-to-harness progression
- Effective Harnesses for Long-Running Agents (Anthropic) — Patterns for long-running agent sessions, including the shift-worker model
Footnotes
-
Bockeler, B. (2026). Harness Engineering for Coding Agent Users. Martin Fowler / Thoughtworks. The most comprehensive treatment of guides, sensors, and harness categories. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
Hashimoto, M. (2026). My AI Adoption Journey. mitchellh.com. The blog post that named harness engineering and proposed the Agent = Model + Harness equation. ↩ ↩2 ↩3 ↩4
-
Anthropic. (2026). Demystifying Evals for AI Agents. Anthropic Engineering. Comprehensive guide to grading methods, metrics, and eval design. ↩ ↩2 ↩3
-
LangChain. (2026). Improving Deep Agents with Harness Engineering. LangChain Blog. Demonstrates a jump from outside Top 30 to Top 5 on Terminal Bench 2.0 by improving only the harness. ↩
-
Mollick, E. (2026). A Guide to Which AI to Use in the Agentic Era. One Useful Thing. Frames the Model/App/Harness distinction for a general audience. ↩
