Context Engineering

The practice of curating, structuring, and managing all the information a language model can see when it generates a response — not just the prompt, but everything around it.

What is it?

Most people think working with AI is about writing good prompts. It is not. The prompt — the instruction you type — is a fraction of what the model actually sees. The model also sees a system prompt, retrieved documents, conversation history, tool definitions, memory from prior sessions, and metadata. All of it. At once. In a single context window. The quality of that entire context determines the quality of the output far more than the wording of any single instruction.¹

In June 2025, Tobi Lutke, CEO of Shopify, proposed a reframe that caught fire across the industry: “I really like the term ‘context engineering’ over ‘prompt engineering.’ It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.”² Andrej Karpathy, former head of AI at Tesla, agreed: “People associate prompts with short task descriptions. In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step.”³

The distinction matters because it shifts the focus from what you say to what the model knows. A mediocre prompt surrounded by excellent context produces better results than a brilliant prompt surrounded by noise. Prompt engineering asks “how should I phrase this?” Context engineering asks “what does the model need to see before I ask anything at all?”⁴

Philipp Schmid, a technical lead at Hugging Face, put it concisely: think of an LLM like a CPU, and its context window as RAM. Your job is akin to an operating system — load that working memory with just the right code and data for the task.⁵

In plain terms

Context engineering is the skill of preparing everything an AI model will read before it answers you. It is the difference between asking a colleague a question cold and asking them the same question after handing them the relevant documents, the project history, and a clear brief. The question is the same. The answer is vastly different.

At a glance

What fills the context window (click to expand)
graph TD
    subgraph Context Window
        SP[System Prompt] --> I[Task Instruction]
        RD[Retrieved Documents] --> I
        CH[Conversation History] --> I
        TD[Tool Definitions] --> I
        MM[Memory and State] --> I
    end
    I --> O[Model Output]

    style I fill:#4a9ede,color:#fff
    style SP fill:#5cb85c,color:#fff
Key: Everything in the context window is processed together. The model does not distinguish between “prompt” and “background” — it attends to all of it simultaneously. Context engineering is the practice of controlling what fills this window.

How does it work?

The context window — finite and competitive

Every language model has a context window — the maximum amount of text it can process at once. Claude’s window is up to 200,000 tokens. GPT-4o supports 128,000. Gemini goes to 2 million. These numbers sound enormous, but they are smaller than they appear, because every token competes for the model’s attention.¹

Research from Chroma demonstrates that every frontier model exhibits context rot — performance degrades as the context window fills up. Accuracy drops start around 32,000 tokens for most models. The model does not “forget” information in the window; it spreads its attention thinner across more tokens, making it more likely to miss or misweight critical information.⁶

Stanford’s “Lost in the Middle” research shows the pattern is not uniform: LLMs attend most to information at the beginning and end of the context, with a significant accuracy dip for information placed in the middle — a U-shaped attention curve.⁷

Think of it like...

A desk. You can pile as many documents on it as you like, but if the desk is covered in irrelevant papers, you will struggle to find the one fact you need. Context engineering is keeping the desk clear except for exactly the documents this task requires — and placing the most important ones where your eyes naturally land.

What goes in — the five sources of context

Anthropic’s engineering guide identifies five categories of information that fill a context window:¹

Source	What it is	Example
System prompt	Standing instructions loaded every session	”You are a senior data analyst. Always cite sources.”
Retrieved documents	Information pulled from external sources at runtime	Database records, API responses, RAG results
Conversation history	Prior messages in the current session	User’s earlier questions and the model’s responses
Tool definitions	Descriptions of tools the model can call	File read, web search, database query schemas
Memory and state	Persistent information from prior sessions	User preferences, project decisions, past outputs

The challenge is that all five compete for the same finite window. Loading a 50-page document as context leaves less room for conversation history. Adding 20 tool definitions reduces space for retrieved data. Context engineering is the discipline of making these trade-offs deliberately rather than by accident.¹

How to manage it — four core strategies

Anthropic’s guide documents four production-tested strategies for keeping context healthy:¹

1. Compaction

As conversations grow, older messages accumulate and consume the window. Compaction summarises the conversation while preserving key decisions, creating a compressed representation that retains meaning with fewer tokens. The model “reads its own notes” instead of re-processing every prior message.

2. Structured note-taking

Rather than keeping everything in the conversation, the agent writes persistent notes to files outside the context window. These notes can be retrieved later when relevant, but they do not consume context when they are not needed. This is how agent-memory works in practice — the model externalises information to a persistent store and retrieves it on demand.

3. Sub-agent isolation

Complex tasks are decomposed into sub-tasks, each handled by a specialised agent with a clean context window. The sub-agent processes its task in isolation and returns only a distilled summary to the parent agent. This prevents context pollution — the phenomenon where irrelevant information from one task degrades performance on another.¹

graph TD
    O[Orchestrator Agent] --> S1[Sub-Agent: Research]
    O --> S2[Sub-Agent: Analysis]
    O --> S3[Sub-Agent: Writing]
    S1 -->|summary only| O
    S2 -->|summary only| O
    S3 -->|summary only| O

    style O fill:#4a9ede,color:#fff

4. Just-in-time loading

Instead of pre-loading all potentially relevant information, the agent maintains lightweight references (file paths, URLs, query identifiers) and loads data into context only when a specific task requires it. This keeps the context lean until the moment the information is actually needed.¹

The fundamental principle

Anthropic’s guide frames it as an optimisation problem: find “the smallest set of high-signal tokens that maximise the likelihood of your desired outcome.” Every irrelevant token in context actively degrades performance. Less is more — but the right less.¹

How to structure it — placement and ordering

What goes in matters. Where it goes matters just as much.¹

Principle	Why it works
Reference material first, query last	Exploits the model’s recency bias; queries placed at the end improve response quality
Static content before dynamic content	Enables prompt caching (up to 90% cost reduction) and gives stable context higher positional priority
Most important at the edges	Mitigates the “Lost in the Middle” effect — critical information performs best at the beginning or end of context
Layer from general to specific	context-cascading — global rules, then domain context, then task instructions. Each layer narrows the next