How to Talk to AI: From Prompts to Context to Harnesses
Most people think the skill of working with AI is knowing what to type. It is not. The skill is knowing what the model needs to see, how to verify what it produces, and how to build a system that stays reliable when you are not watching.
Who this is for
You have used ChatGPT, Claude, or Copilot. You have typed prompts and received outputs. Sometimes the output was brilliant. Sometimes it was confidently wrong. You want to understand why the difference — and how to make the good results consistent.
This path is for you if:
- You use AI tools regularly but feel like results are hit-or-miss
- You want to move from casual prompting to building reliable AI-assisted workflows
- You want to understand the full stack of LLM management, not just “prompt tricks”
What this article is NOT
This is not a list of prompt templates. This is a thinking framework for understanding the three layers of LLM management — and knowing which layer to invest in for your situation.
Part 1 — Intent, not incantation
The most common misconception about working with AI: that there are magic words. That if you phrase your prompt just right — add “think step by step,” say “you are an expert,” use the right XML tags — the model will produce perfect output.
This is backwards. The single most important factor in LLM output quality is not phrasing. It is intent clarity — how precisely you have communicated what you want, why you want it, and what “good” looks like.1
graph LR A[Vague intent] -->|any prompt format| B[Unpredictable output] C[Clear intent] -->|any prompt format| D[Useful output] style B fill:#e74c3c,color:#fff style D fill:#5cb85c,color:#fff
Anthropic’s prompting guide puts it directly: “Think of Claude as a brilliant but new employee who lacks context about your norms, preferences, and workflows. The more precisely you communicate what you want, the better the output.”1
The test is simple: show your prompt to a colleague who knows nothing about your project. If they would be confused about what you want, the model will be too. Format tricks cannot compensate for unclear intent.
This does not mean format is irrelevant. Structure helps. XML tags reduce misinterpretation. Examples anchor expectations. But these are amplifiers of clear intent, not substitutes for it.
Why this matters for you
Stop searching for the perfect prompt template. Start by writing down — in plain language — exactly what you want, why you want it, what constraints apply, and what the output should look like. That document is the prompt. Everything else is formatting.
Part 2 — Prompt engineering: the instruction layer
prompt-engineering is the practice of crafting the instruction you give to a language model. It is the first layer of LLM management, and despite the hype, what actually matters is surprisingly simple.1
What works
Anthropic, OpenAI, and Google converge on the same fundamentals:12
| Principle | What it means | Why it works |
|---|---|---|
| Be specific | State exactly what you want, not roughly | The model has no way to infer unstated requirements |
| Explain why | Give the motivation behind constraints | ”Never use ellipses — this will be read by text-to-speech” is clearer than “never use ellipses” |
| Show examples | 3-5 diverse input/output pairs | Examples anchor format and quality expectations better than any description |
| Assign a role | ”You are a senior data analyst” | Focuses tone, vocabulary, and depth |
| Structure the output | Specify format, length, sections | Removes ambiguity about what “a good response” looks like |
What does not work
- Magic phrases. “Think step by step” helps reasoning models slightly but is not a universal fix. On reasoning-native models like o3 and DeepSeek-R1, chain-of-thought instructions actually degrade output because they think internally.3
- Over-engineering format. Spending hours on XML tag structure when the underlying intent is unclear produces beautifully formatted wrong answers.
- Portable prompts. The same prompt behaves differently across models. Claude responds well to XML tags. GPT-5 responds well to structured persona adoption. Treat prompts as model-specific.3
Prompt chaining
Complex tasks cannot be handled in a single prompt. prompt-chaining breaks a task into sequential steps where each output feeds the next.4
graph LR A[Step 1: Research] -->|output| B[Step 2: Outline] B -->|output| C[Step 3: Draft] C -->|output| D[Step 4: Review] D -->|feedback| C style A fill:#e8b84b,color:#fff style D fill:#4a9ede,color:#fff
Three patterns:
| Pattern | How it works | When to use |
|---|---|---|
| Linear | Each step feeds the next | Tasks with clear sequential dependencies |
| Branching | Output triggers different paths | Tasks where the next step depends on the result |
| Self-correction loop | Generate, review against criteria, refine | Any task where quality matters |
The self-correction loop is now the most common production pattern: draft something, have the model evaluate it against explicit criteria, then revise based on the evaluation. This catches errors that a single pass misses.
Chunking
Chunking is different from chaining. Where chaining breaks tasks into steps, chunking breaks inputs into manageable pieces. A 50-page document exceeds what a model can process well in one pass — splitting it into sections and processing each individually produces better results than dumping everything in at once.5
The prompt engineering principle
Prompt engineering is necessary but not sufficient. It governs what you ask the model. But the quality of the answer depends at least as much on what the model knows when it answers. That is context engineering.
Part 3 — Context engineering: the information layer
In June 2025, Tobi Lutke, CEO of Shopify, proposed a reframe: “I really like the term ‘context engineering’ over ‘prompt engineering.’ It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”6
Andrej Karpathy agreed: “People associate prompts with short task descriptions. In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step.”7
context-engineering is the practice of curating and structuring all the information a model can see when it generates a response — not just your instruction, but the system prompt, the retrieved documents, the conversation history, the tool definitions, and the memory from prior sessions.5
Why context beats prompting
The same prompt produces radically different outputs depending on what context surrounds it. Ask “write a function to validate email” with no context and you get a generic regex. Ask it with your codebase’s existing validation patterns, your error handling conventions, and your test framework loaded into context, and you get code that fits your project.
graph TD subgraph Context Window SP[System Prompt] --> TI[Task Instruction] RD[Retrieved Documents] --> TI CH[Conversation History] --> TI TD[Tool Definitions] --> TI MM[Memory and State] --> TI end TI --> O[Model Output] style TI fill:#4a9ede,color:#fff
What goes in, and in what order
Anthropic’s research shows that placement matters:5
- Put reference material first, query last. Queries placed at the end of context improve response quality by up to 30%.
- Static content before dynamic content. This also enables prompt caching, reducing costs by up to 90%.
- Less is more. Every token in context competes for the model’s attention. Irrelevant information degrades output — a phenomenon researchers call context rot.8
Stanford’s “Lost in the Middle” research confirms the risk: LLMs exhibit a U-shaped attention curve, with highest accuracy when relevant information is at the beginning or end of context, and 30%+ degradation when critical information sits in the middle.9
Structuring context for quality
The key insight from Anthropic’s context engineering guide: treat context as a precious, finite resource and find “the smallest set of high-signal tokens that maximise the likelihood of your desired outcome.”5
Practical strategies:
| Strategy | What it does |
|---|---|
| Compaction | Summarise conversation history while preserving key decisions |
| Structured note-taking | Agents write notes persisted outside the context, retrievable later |
| The AGENTS.md pattern | Preload architectural context; use tools for just-in-time retrieval |
| Sub-agent isolation | Specialised agents handle focused tasks with clean contexts, returning distilled summaries |
Avoiding slop
“Slop” — generic, verbose, low-signal output — is almost always a context problem, not a model problem. The model produces slop when it lacks the specific context needed to produce something precise. The fix is not a better prompt. The fix is better context: real examples, specific constraints, domain knowledge, and a clear definition of what “good” looks like.
context-cascading formalises this: instructions are organised in layers from broad to specific — global rules, then domain context, then task instructions, then output format — loaded in sequence so each layer narrows the scope of the next.
The context engineering principle
The quality of an LLM’s output is bounded by the quality of its context. A mediocre prompt with excellent context beats an excellent prompt with poor context — every time.
Part 4 — Harness engineering: the system layer
In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) published an account of his AI adoption journey and gave a name to a practice that production teams had been developing independently: harness engineering.10
The core equation: Agent = Model + Harness.
The harness is everything in an AI system except the model itself: the tools, the permissions, the state management, the testing, the logging, the guardrails, the feedback loops, and the verification systems that make the model’s output reliable.11
graph TD subgraph Harness G[Guides] --> M[Model] M --> S[Sensors] S -->|feedback| M G -.->|AGENTS.md, specs, rules| G S -.->|tests, linters, evals| S end style M fill:#4a9ede,color:#fff style G fill:#5cb85c,color:#fff style S fill:#e8b84b,color:#fff
Why the harness matters more than the model
LangChain’s DeepAgent moved from outside the Top 30 to the Top 5 on Terminal Bench 2.0 — by changing only the harness while keeping the same model (GPT-5.2-Codex). Key improvements: self-verification loops, loop detection, context management, and time budgeting.12
The insight: a mediocre model in an excellent harness outperforms an excellent model with no harness. This is why Ethan Mollick observes that “the same model can behave very differently depending on what harness it’s operating in.”13
Guides and sensors
Birgitta Bockeler, Distinguished Engineer at Thoughtworks, published the most comprehensive treatment of harness engineering, proposing two categories of controls:11
Guides (feedforward controls) steer the agent before it acts:
- Documentation — AGENTS.md files, architecture specs, coding standards that the agent reads before starting work
- Bootstrap scripts — automated setup that ensures the agent starts in a known-good state
- Constraints — explicit boundaries on what the agent can and cannot modify
Sensors (feedback controls) observe after the agent acts and enable self-correction:
- Computational sensors — tests, linters, type checkers. Fast, deterministic, and cheap.
- Inferential sensors — AI-powered code review, semantic analysis. Slower but capable of richer judgement.
- Custom feedback — linter messages that include instructions for self-correction, so the agent can fix its own mistakes.
Think of it like...
A harness is to an AI agent what a cockpit is to a pilot. The pilot (model) has the skill to fly, but the cockpit provides the instruments (sensors), the checklists (guides), the autopilot constraints (guardrails), and the black box (logging) that make flying safe and reliable. A pilot without a cockpit can still fly — but you would not board that plane.
Evaluations: the foundation of harness engineering
evaluations (evals) are systematic tests that measure whether an LLM system produces correct, useful, safe output. They are the single most important component of a harness because without them, you cannot know whether anything else is working.14
| Eval type | What it tests | Example |
|---|---|---|
| Correctness | Does the output match expected answers? | Compare against gold-standard test cases |
| Consistency | Does the same input produce similar outputs? | Run the same prompt 10 times and measure variance |
| Safety | Does the output respect boundaries? | Test for hallucination, data leakage, harmful content |
| Regression | Did a change break something that worked before? | Run the full eval suite after every system change |
Anthropic’s guidance on long-running agents emphasises that effective harnesses “build verification into the workflow itself” rather than checking outputs only at the end.15
Guardrails
guardrails are constraints that prevent the agent from taking actions outside its intended scope. They operate at multiple levels:
- Input guardrails — validate and sanitise what goes into the model
- Output guardrails — check what comes out before it reaches the user or executes an action
- Action guardrails — restrict what tools the agent can use, what files it can modify, what APIs it can call
- Escalation guardrails — human-in-the-loop checkpoints for high-stakes or ambiguous situations
The harness engineering principle
You do not make AI reliable by finding a better model. You make it reliable by building a system that catches errors, enforces boundaries, and improves with every failure. The model is the engine. The harness is everything else.
Part 5 — The three layers in practice
Prompt engineering, context engineering, and harness engineering are not competing approaches. They are concentric layers, each encompassing the one before:16
graph TD subgraph Harness Engineering subgraph Context Engineering subgraph Prompt Engineering PE[What you ask] end CE[What the model sees] end HE[How the system works] end style PE fill:#e8b84b,color:#fff style CE fill:#4a9ede,color:#fff style HE fill:#5cb85c,color:#fff
| Layer | Core question | What you build | When to invest |
|---|---|---|---|
| Prompt | How do I ask? | Instructions, examples, role definitions | Always — this is the foundation |
| Context | What does the model know? | Context structures, retrieval, memory, cascading | When prompts alone produce inconsistent results |
| Harness | How does the system work? | Evals, guardrails, sensors, guides, observability | When you ship to production or run agents autonomously |
The progression maps to maturity:
- Exploring — you prompt directly in a chat interface. Prompt engineering is enough.
- Building — you integrate AI into a workflow. Context engineering becomes essential.
- Shipping — you deploy AI that runs without you watching. Harness engineering is mandatory.
In plain terms
Prompt engineering is writing a good email. Context engineering is making sure the recipient has all the background documents they need. Harness engineering is building the entire office system — email, filing, review process, quality checks — that ensures the work gets done correctly even when you are not in the room.
Part 6 — The spectrum: chatting to shipping
Not every use of AI requires all three layers. The investment should match the stakes:
graph LR A[Chat] -->|add structure| B[Workflow] B -->|add verification| C[Product] C -->|add autonomy| D[Agent System] style A fill:#e8b84b,color:#fff style B fill:#4a9ede,color:#fff style C fill:#5cb85c,color:#fff style D fill:#9b59b6,color:#fff
| Level | Example | Layers needed |
|---|---|---|
| Chat | Asking Claude to explain a concept | Prompt only |
| Workflow | Using AI to draft, then reviewing yourself | Prompt + basic context |
| Product | AI feature in an app that users interact with | Prompt + context + evals |
| Agent system | AI that acts autonomously across sessions | All three layers, fully |
The key question
Ask yourself: “If this AI produces wrong output and I don’t notice for an hour, what happens?” If the answer is “nothing serious,” prompt engineering is enough. If the answer makes you nervous, you need a harness.
Part 7 — The map so far
graph TD LLM[LLM Engineering] --> PE[Prompt Engineering] PE --> IC[Intent Clarity] PE --> PC[Prompt Chaining] PE --> CH[Chunking] PE --> EX[Examples and Role] LLM --> CE[Context Engineering] CE --> CC[Context Cascading] CE --> GR[Granularity] CE --> RT[Retrieval and RAG] CE --> MM[Memory and State] LLM --> HE[Harness Engineering] HE --> EV[Evaluations] HE --> GU[Guardrails] HE --> GD[Guides - Feedforward] HE --> SN[Sensors - Feedback] HE --> OB[Observability] style LLM fill:#4a9ede,color:#fff
What you now understand
Mental models you have gained
- Intent over incantation — clarity of what you want matters more than how you phrase it
- Prompt engineering — the instruction layer: be specific, explain why, show examples, chain complex tasks
- Context engineering — the information layer: curate what the model sees; less is more; structure and placement matter
- Harness engineering — the system layer: Agent = Model + Harness; guides steer before, sensors correct after
- The three layers are concentric — each encompasses the one before; invest in the layer that matches your stakes
- Evaluations are non-negotiable — without evals, you cannot know if anything else works
- The harness beats the model — a good system around a mediocre model outperforms a great model with no system
Check your understanding
Test yourself before moving on (click to expand)
- Explain the difference between prompt engineering, context engineering, and harness engineering. What question does each layer answer?
- Describe three strategies for improving context quality and explain why context placement matters.
- Distinguish between guides and sensors in a harness. Give a concrete example of each for an AI coding assistant.
- Interpret this scenario: an AI-powered customer support bot gives correct answers 80% of the time but occasionally fabricates return policies. The team’s instinct is to rewrite the prompt. Using what you have learned, identify which layer is most likely the problem and propose a solution.
- Design a simple harness for an AI tool that summarises meeting notes. Identify at least two guides, two sensors, and one escalation guardrail.
Where to go next
I want to understand how LLMs actually work inside
This path covers how to use LLMs. If you want to understand the machinery — tokenization, embeddings, next-token prediction, parameters, training, and the transformer architecture — read inside-the-machine.
Best for: People who want accurate mental models of the technology they are working with.
I want to understand the AI architecture underneath
This path covers how to work with LLMs. If you want to understand how agentic systems are structured — routing, orchestration, knowledge, and pipelines — read agentic-design.
Best for: People building AI systems, not just using AI tools.
I want to understand the software fundamentals
LLM engineering sits on top of software architecture. If you have not read it yet, from-zero-to-building covers the base layer: frontend, backend, APIs, databases, and the document chain from intent to code.
Best for: People who want the full stack picture.
I want to build something now
Take these principles and apply them. Start a project through the learning pipeline, define your intent, and the system will match relevant concepts and generate a learning path tailored to your project.
Best for: People who learn by building.
Sources
Further reading
Resources
- Effective Context Engineering for AI Agents (Anthropic) — The definitive guide to context curation, compaction, memory, and sub-agent architectures
- Harness Engineering for Coding Agent Users (Bockeler / Thoughtworks) — The most comprehensive technical treatment of guides, sensors, and harness design
- Effective Harnesses for Long-Running Agents (Anthropic) — How to build verification into autonomous agent workflows
- Prompting Best Practices (Anthropic) — The official guide to prompting Claude effectively
- Context Engineering (Simon Willison) — Why “context engineering” is a better frame than “prompt engineering” and what the shift means in practice
- A Guide to Which AI to Use in the Agentic Era (Mollick) — Ethan Mollick’s accessible framework for understanding models, apps, and harnesses
Footnotes
-
Anthropic. (2026). Prompting Best Practices. Anthropic. The official prompting guide covering clarity, motivation, examples, and structure. ↩ ↩2 ↩3 ↩4
-
OpenAI. (2026). Prompt Engineering Guide. OpenAI. Official best practices for GPT models. ↩
-
IBM. (2026). The 2026 Guide to Prompt Engineering. IBM. Covers model-specific prompting and the death of portable prompts. ↩ ↩2
-
Maxim AI. (2026). Prompt Chaining for AI Engineers. Maxim AI. Practical guide to linear, branching, and self-correction chain patterns. ↩
-
Anthropic. (2025). Effective Context Engineering for AI Agents. Anthropic. The canonical reference on context curation, compaction, and sub-agent architectures. ↩ ↩2 ↩3 ↩4
-
Lutke, T. (2025). Tweet on context engineering. June 19, 2025. The tweet that popularised “context engineering” as a term. ↩
-
Karpathy, A. (2025). Tweet on context engineering. June 2025. Endorsed the term and distinguished it from casual prompting. ↩
-
Chroma Research. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma. Empirical evidence that every frontier model degrades as context length increases. ↩
-
Liu, N.F. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 2024 / Stanford. Demonstrates the U-shaped attention curve in long-context LLMs. ↩
-
Hashimoto, M. (2026). My AI Adoption Journey. mitchellh.com. The blog post that named harness engineering as a discipline. ↩
-
Bockeler, B. (2026). Harness Engineering for Coding Agent Users. Martin Fowler / Thoughtworks. The most comprehensive technical treatment of guides, sensors, and harness categories. ↩ ↩2
-
LangChain. (2026). Improving Deep Agents with Harness Engineering. LangChain Blog. Demonstrates a jump from outside Top 30 to Top 5 on Terminal Bench by improving only the harness. ↩
-
Mollick, E. (2026). A Guide to Which AI to Use in the Agentic Era. One Useful Thing. Frames the Model/App/Harness distinction for a general audience. ↩
-
arXiv. (2026). LLM Readiness Harness: Evaluation, Observability, and CI Gates. arXiv. Technical framework for evaluation harnesses in production. ↩
-
Anthropic. (2025). Effective Harnesses for Long-Running Agents. Anthropic. Guidance on building verification into agent workflows. ↩
-
Bouchard, L-F. (2026). Harness Engineering: The Missing Layer Behind AI Agents. louisbouchard.ai. Contextualises harness engineering within the prompt-to-harness progression. ↩
