How to Talk to AI: From Prompts to Context to Harnesses

Most people think the skill of working with AI is knowing what to type. It is not. The skill is knowing what the model needs to see, how to verify what it produces, and how to build a system that stays reliable when you are not watching.


Who this is for

You have used ChatGPT, Claude, or Copilot. You have typed prompts and received outputs. Sometimes the output was brilliant. Sometimes it was confidently wrong. You want to understand why the difference — and how to make the good results consistent.

This path is for you if:

  • You use AI tools regularly but feel like results are hit-or-miss
  • You want to move from casual prompting to building reliable AI-assisted workflows
  • You want to understand the full stack of LLM management, not just “prompt tricks”

What this article is NOT

This is not a list of prompt templates. This is a thinking framework for understanding the three layers of LLM management — and knowing which layer to invest in for your situation.


Part 1 — Intent, not incantation

The most common misconception about working with AI: that there are magic words. That if you phrase your prompt just right — add “think step by step,” say “you are an expert,” use the right XML tags — the model will produce perfect output.

This is backwards. The single most important factor in LLM output quality is not phrasing. It is intent clarity — how precisely you have communicated what you want, why you want it, and what “good” looks like.1

graph LR
    A[Vague intent] -->|any prompt format| B[Unpredictable output]
    C[Clear intent] -->|any prompt format| D[Useful output]

    style B fill:#e74c3c,color:#fff
    style D fill:#5cb85c,color:#fff

Anthropic’s prompting guide puts it directly: “Think of Claude as a brilliant but new employee who lacks context about your norms, preferences, and workflows. The more precisely you communicate what you want, the better the output.”1

The test is simple: show your prompt to a colleague who knows nothing about your project. If they would be confused about what you want, the model will be too. Format tricks cannot compensate for unclear intent.

This does not mean format is irrelevant. Structure helps. XML tags reduce misinterpretation. Examples anchor expectations. But these are amplifiers of clear intent, not substitutes for it.

Why this matters for you

Stop searching for the perfect prompt template. Start by writing down — in plain language — exactly what you want, why you want it, what constraints apply, and what the output should look like. That document is the prompt. Everything else is formatting.


Part 2 — Prompt engineering: the instruction layer

prompt-engineering is the practice of crafting the instruction you give to a language model. It is the first layer of LLM management, and despite the hype, what actually matters is surprisingly simple.1

What works

Anthropic, OpenAI, and Google converge on the same fundamentals:12

PrincipleWhat it meansWhy it works
Be specificState exactly what you want, not roughlyThe model has no way to infer unstated requirements
Explain whyGive the motivation behind constraints”Never use ellipses — this will be read by text-to-speech” is clearer than “never use ellipses”
Show examples3-5 diverse input/output pairsExamples anchor format and quality expectations better than any description
Assign a role”You are a senior data analyst”Focuses tone, vocabulary, and depth
Structure the outputSpecify format, length, sectionsRemoves ambiguity about what “a good response” looks like

What does not work

  • Magic phrases. “Think step by step” helps reasoning models slightly but is not a universal fix. On reasoning-native models like o3 and DeepSeek-R1, chain-of-thought instructions actually degrade output because they think internally.3
  • Over-engineering format. Spending hours on XML tag structure when the underlying intent is unclear produces beautifully formatted wrong answers.
  • Portable prompts. The same prompt behaves differently across models. Claude responds well to XML tags. GPT-5 responds well to structured persona adoption. Treat prompts as model-specific.3

Prompt chaining

Complex tasks cannot be handled in a single prompt. prompt-chaining breaks a task into sequential steps where each output feeds the next.4

graph LR
    A[Step 1: Research] -->|output| B[Step 2: Outline]
    B -->|output| C[Step 3: Draft]
    C -->|output| D[Step 4: Review]
    D -->|feedback| C

    style A fill:#e8b84b,color:#fff
    style D fill:#4a9ede,color:#fff

Three patterns:

PatternHow it worksWhen to use
LinearEach step feeds the nextTasks with clear sequential dependencies
BranchingOutput triggers different pathsTasks where the next step depends on the result
Self-correction loopGenerate, review against criteria, refineAny task where quality matters

The self-correction loop is now the most common production pattern: draft something, have the model evaluate it against explicit criteria, then revise based on the evaluation. This catches errors that a single pass misses.

Chunking

Chunking is different from chaining. Where chaining breaks tasks into steps, chunking breaks inputs into manageable pieces. A 50-page document exceeds what a model can process well in one pass — splitting it into sections and processing each individually produces better results than dumping everything in at once.5

The prompt engineering principle

Prompt engineering is necessary but not sufficient. It governs what you ask the model. But the quality of the answer depends at least as much on what the model knows when it answers. That is context engineering.


Part 3 — Context engineering: the information layer

In June 2025, Tobi Lutke, CEO of Shopify, proposed a reframe: “I really like the term ‘context engineering’ over ‘prompt engineering.’ It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”6

Andrej Karpathy agreed: “People associate prompts with short task descriptions. In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step.”7

context-engineering is the practice of curating and structuring all the information a model can see when it generates a response — not just your instruction, but the system prompt, the retrieved documents, the conversation history, the tool definitions, and the memory from prior sessions.5

Why context beats prompting

The same prompt produces radically different outputs depending on what context surrounds it. Ask “write a function to validate email” with no context and you get a generic regex. Ask it with your codebase’s existing validation patterns, your error handling conventions, and your test framework loaded into context, and you get code that fits your project.

graph TD
    subgraph Context Window
        SP[System Prompt] --> TI[Task Instruction]
        RD[Retrieved Documents] --> TI
        CH[Conversation History] --> TI
        TD[Tool Definitions] --> TI
        MM[Memory and State] --> TI
    end
    TI --> O[Model Output]

    style TI fill:#4a9ede,color:#fff

What goes in, and in what order

Anthropic’s research shows that placement matters:5

  1. Put reference material first, query last. Queries placed at the end of context improve response quality by up to 30%.
  2. Static content before dynamic content. This also enables prompt caching, reducing costs by up to 90%.
  3. Less is more. Every token in context competes for the model’s attention. Irrelevant information degrades output — a phenomenon researchers call context rot.8

Stanford’s “Lost in the Middle” research confirms the risk: LLMs exhibit a U-shaped attention curve, with highest accuracy when relevant information is at the beginning or end of context, and 30%+ degradation when critical information sits in the middle.9

Structuring context for quality

The key insight from Anthropic’s context engineering guide: treat context as a precious, finite resource and find “the smallest set of high-signal tokens that maximise the likelihood of your desired outcome.”5

Practical strategies:

StrategyWhat it does
CompactionSummarise conversation history while preserving key decisions
Structured note-takingAgents write notes persisted outside the context, retrievable later
The AGENTS.md patternPreload architectural context; use tools for just-in-time retrieval
Sub-agent isolationSpecialised agents handle focused tasks with clean contexts, returning distilled summaries

Avoiding slop

“Slop” — generic, verbose, low-signal output — is almost always a context problem, not a model problem. The model produces slop when it lacks the specific context needed to produce something precise. The fix is not a better prompt. The fix is better context: real examples, specific constraints, domain knowledge, and a clear definition of what “good” looks like.

context-cascading formalises this: instructions are organised in layers from broad to specific — global rules, then domain context, then task instructions, then output format — loaded in sequence so each layer narrows the scope of the next.

The context engineering principle

The quality of an LLM’s output is bounded by the quality of its context. A mediocre prompt with excellent context beats an excellent prompt with poor context — every time.


Part 4 — Harness engineering: the system layer

In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) published an account of his AI adoption journey and gave a name to a practice that production teams had been developing independently: harness engineering.10

The core equation: Agent = Model + Harness.

The harness is everything in an AI system except the model itself: the tools, the permissions, the state management, the testing, the logging, the guardrails, the feedback loops, and the verification systems that make the model’s output reliable.11

graph TD
    subgraph Harness
        G[Guides] --> M[Model]
        M --> S[Sensors]
        S -->|feedback| M
        G -.->|AGENTS.md, specs, rules| G
        S -.->|tests, linters, evals| S
    end

    style M fill:#4a9ede,color:#fff
    style G fill:#5cb85c,color:#fff
    style S fill:#e8b84b,color:#fff

Why the harness matters more than the model

LangChain’s DeepAgent moved from outside the Top 30 to the Top 5 on Terminal Bench 2.0 — by changing only the harness while keeping the same model (GPT-5.2-Codex). Key improvements: self-verification loops, loop detection, context management, and time budgeting.12

The insight: a mediocre model in an excellent harness outperforms an excellent model with no harness. This is why Ethan Mollick observes that “the same model can behave very differently depending on what harness it’s operating in.”13

Guides and sensors

Birgitta Bockeler, Distinguished Engineer at Thoughtworks, published the most comprehensive treatment of harness engineering, proposing two categories of controls:11

Guides (feedforward controls) steer the agent before it acts:

  • Documentation — AGENTS.md files, architecture specs, coding standards that the agent reads before starting work
  • Bootstrap scripts — automated setup that ensures the agent starts in a known-good state
  • Constraints — explicit boundaries on what the agent can and cannot modify

Sensors (feedback controls) observe after the agent acts and enable self-correction:

  • Computational sensors — tests, linters, type checkers. Fast, deterministic, and cheap.
  • Inferential sensors — AI-powered code review, semantic analysis. Slower but capable of richer judgement.
  • Custom feedback — linter messages that include instructions for self-correction, so the agent can fix its own mistakes.

Think of it like...

A harness is to an AI agent what a cockpit is to a pilot. The pilot (model) has the skill to fly, but the cockpit provides the instruments (sensors), the checklists (guides), the autopilot constraints (guardrails), and the black box (logging) that make flying safe and reliable. A pilot without a cockpit can still fly — but you would not board that plane.

Evaluations: the foundation of harness engineering

evaluations (evals) are systematic tests that measure whether an LLM system produces correct, useful, safe output. They are the single most important component of a harness because without them, you cannot know whether anything else is working.14

Eval typeWhat it testsExample
CorrectnessDoes the output match expected answers?Compare against gold-standard test cases
ConsistencyDoes the same input produce similar outputs?Run the same prompt 10 times and measure variance
SafetyDoes the output respect boundaries?Test for hallucination, data leakage, harmful content
RegressionDid a change break something that worked before?Run the full eval suite after every system change

Anthropic’s guidance on long-running agents emphasises that effective harnesses “build verification into the workflow itself” rather than checking outputs only at the end.15

Guardrails

guardrails are constraints that prevent the agent from taking actions outside its intended scope. They operate at multiple levels:

  • Input guardrails — validate and sanitise what goes into the model
  • Output guardrails — check what comes out before it reaches the user or executes an action
  • Action guardrails — restrict what tools the agent can use, what files it can modify, what APIs it can call
  • Escalation guardrailshuman-in-the-loop checkpoints for high-stakes or ambiguous situations

The harness engineering principle

You do not make AI reliable by finding a better model. You make it reliable by building a system that catches errors, enforces boundaries, and improves with every failure. The model is the engine. The harness is everything else.


Part 5 — The three layers in practice

Prompt engineering, context engineering, and harness engineering are not competing approaches. They are concentric layers, each encompassing the one before:16

graph TD
    subgraph Harness Engineering
        subgraph Context Engineering
            subgraph Prompt Engineering
                PE[What you ask]
            end
            CE[What the model sees]
        end
        HE[How the system works]
    end

    style PE fill:#e8b84b,color:#fff
    style CE fill:#4a9ede,color:#fff
    style HE fill:#5cb85c,color:#fff
LayerCore questionWhat you buildWhen to invest
PromptHow do I ask?Instructions, examples, role definitionsAlways — this is the foundation
ContextWhat does the model know?Context structures, retrieval, memory, cascadingWhen prompts alone produce inconsistent results
HarnessHow does the system work?Evals, guardrails, sensors, guides, observabilityWhen you ship to production or run agents autonomously

The progression maps to maturity:

  • Exploring — you prompt directly in a chat interface. Prompt engineering is enough.
  • Building — you integrate AI into a workflow. Context engineering becomes essential.
  • Shipping — you deploy AI that runs without you watching. Harness engineering is mandatory.

In plain terms

Prompt engineering is writing a good email. Context engineering is making sure the recipient has all the background documents they need. Harness engineering is building the entire office system — email, filing, review process, quality checks — that ensures the work gets done correctly even when you are not in the room.


Part 6 — The spectrum: chatting to shipping

Not every use of AI requires all three layers. The investment should match the stakes:

graph LR
    A[Chat] -->|add structure| B[Workflow]
    B -->|add verification| C[Product]
    C -->|add autonomy| D[Agent System]

    style A fill:#e8b84b,color:#fff
    style B fill:#4a9ede,color:#fff
    style C fill:#5cb85c,color:#fff
    style D fill:#9b59b6,color:#fff
LevelExampleLayers needed
ChatAsking Claude to explain a conceptPrompt only
WorkflowUsing AI to draft, then reviewing yourselfPrompt + basic context
ProductAI feature in an app that users interact withPrompt + context + evals
Agent systemAI that acts autonomously across sessionsAll three layers, fully

The key question

Ask yourself: “If this AI produces wrong output and I don’t notice for an hour, what happens?” If the answer is “nothing serious,” prompt engineering is enough. If the answer makes you nervous, you need a harness.


Part 7 — The map so far

graph TD
    LLM[LLM Engineering] --> PE[Prompt Engineering]
    PE --> IC[Intent Clarity]
    PE --> PC[Prompt Chaining]
    PE --> CH[Chunking]
    PE --> EX[Examples and Role]

    LLM --> CE[Context Engineering]
    CE --> CC[Context Cascading]
    CE --> GR[Granularity]
    CE --> RT[Retrieval and RAG]
    CE --> MM[Memory and State]

    LLM --> HE[Harness Engineering]
    HE --> EV[Evaluations]
    HE --> GU[Guardrails]
    HE --> GD[Guides - Feedforward]
    HE --> SN[Sensors - Feedback]
    HE --> OB[Observability]

    style LLM fill:#4a9ede,color:#fff

What you now understand

Mental models you have gained

  • Intent over incantation — clarity of what you want matters more than how you phrase it
  • Prompt engineering — the instruction layer: be specific, explain why, show examples, chain complex tasks
  • Context engineering — the information layer: curate what the model sees; less is more; structure and placement matter
  • Harness engineering — the system layer: Agent = Model + Harness; guides steer before, sensors correct after
  • The three layers are concentric — each encompasses the one before; invest in the layer that matches your stakes
  • Evaluations are non-negotiable — without evals, you cannot know if anything else works
  • The harness beats the model — a good system around a mediocre model outperforms a great model with no system

Check your understanding


Where to go next

I want to understand how LLMs actually work inside

This path covers how to use LLMs. If you want to understand the machinery — tokenization, embeddings, next-token prediction, parameters, training, and the transformer architecture — read inside-the-machine.

Best for: People who want accurate mental models of the technology they are working with.

I want to understand the AI architecture underneath

This path covers how to work with LLMs. If you want to understand how agentic systems are structured — routing, orchestration, knowledge, and pipelines — read agentic-design.

Best for: People building AI systems, not just using AI tools.

I want to understand the software fundamentals

LLM engineering sits on top of software architecture. If you have not read it yet, from-zero-to-building covers the base layer: frontend, backend, APIs, databases, and the document chain from intent to code.

Best for: People who want the full stack picture.

I want to build something now

Take these principles and apply them. Start a project through the learning pipeline, define your intent, and the system will match relevant concepts and generate a learning path tailored to your project.

Best for: People who learn by building.


Sources


Further reading

Resources

Footnotes

  1. Anthropic. (2026). Prompting Best Practices. Anthropic. The official prompting guide covering clarity, motivation, examples, and structure. 2 3 4

  2. OpenAI. (2026). Prompt Engineering Guide. OpenAI. Official best practices for GPT models.

  3. IBM. (2026). The 2026 Guide to Prompt Engineering. IBM. Covers model-specific prompting and the death of portable prompts. 2

  4. Maxim AI. (2026). Prompt Chaining for AI Engineers. Maxim AI. Practical guide to linear, branching, and self-correction chain patterns.

  5. Anthropic. (2025). Effective Context Engineering for AI Agents. Anthropic. The canonical reference on context curation, compaction, and sub-agent architectures. 2 3 4

  6. Lutke, T. (2025). Tweet on context engineering. June 19, 2025. The tweet that popularised “context engineering” as a term.

  7. Karpathy, A. (2025). Tweet on context engineering. June 2025. Endorsed the term and distinguished it from casual prompting.

  8. Chroma Research. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma. Empirical evidence that every frontier model degrades as context length increases.

  9. Liu, N.F. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 2024 / Stanford. Demonstrates the U-shaped attention curve in long-context LLMs.

  10. Hashimoto, M. (2026). My AI Adoption Journey. mitchellh.com. The blog post that named harness engineering as a discipline.

  11. Bockeler, B. (2026). Harness Engineering for Coding Agent Users. Martin Fowler / Thoughtworks. The most comprehensive technical treatment of guides, sensors, and harness categories. 2

  12. LangChain. (2026). Improving Deep Agents with Harness Engineering. LangChain Blog. Demonstrates a jump from outside Top 30 to Top 5 on Terminal Bench by improving only the harness.

  13. Mollick, E. (2026). A Guide to Which AI to Use in the Agentic Era. One Useful Thing. Frames the Model/App/Harness distinction for a general audience.

  14. arXiv. (2026). LLM Readiness Harness: Evaluation, Observability, and CI Gates. arXiv. Technical framework for evaluation harnesses in production.

  15. Anthropic. (2025). Effective Harnesses for Long-Running Agents. Anthropic. Guidance on building verification into agent workflows.

  16. Bouchard, L-F. (2026). Harness Engineering: The Missing Layer Behind AI Agents. louisbouchard.ai. Contextualises harness engineering within the prompt-to-harness progression.