Agent Memory

How AI agents remember information across conversation turns and sessions — the architectural systems that give stateless language models the ability to retain and recall context.


What is it?

Large language models are stateless by default.1 Every time you start a new conversation, the model has no memory of anything you discussed before. It does not know your name, your preferences, or what it told you yesterday. Each interaction begins from a blank slate.

This is a fundamental problem for agentic systems. An agent that forgets everything between sessions cannot build on previous work, learn from past mistakes, or maintain a coherent relationship with its user. Memory is what transforms a stateless text generator into something that feels like a persistent collaborator.2

Agent memory is not a single mechanism — it is an architectural concern that spans at least three distinct types: short-term memory (the current conversation), working memory (a scratchpad for active reasoning), and long-term memory (persistent knowledge that survives across sessions).3 Each type solves a different problem, and most production agents need all three.

The fundamental constraint underlying all agent memory is the context window — the maximum amount of text a language model can process in a single call.4 Everything the model “knows” during a given interaction must fit inside this window. Memory engineering is, at its core, the art of deciding what goes into that limited space and what stays out.

In plain terms

A language model without memory is like a brilliant expert with amnesia. Every time you walk into their office, they have no idea who you are or what you discussed last time. Agent memory is the system of notes, files, and records that lets this expert pick up where they left off — the conversation summary pinned to their desk, the working notes on their whiteboard, and the client file in their cabinet.


At a glance


How does it work?

1. Short-term memory: conversation history

The most basic form of memory. The system keeps a record of previous messages in the current conversation and includes them in each new request to the model.1 This is how a chatbot “remembers” what you said three messages ago.

The limitation is the context window. As conversations grow longer, the history eventually exceeds what the model can process. At that point, the system must decide what to keep and what to discard — typically using strategies like truncation (drop the oldest messages), summarisation (compress earlier messages into a summary), or sliding windows (keep only the most recent N turns).3

Think of it like...

A whiteboard in a meeting room. Everyone can see what has been written so far. But the whiteboard has a fixed size — once it is full, someone has to erase the oldest notes to make room for new ones. The question is always: what is safe to erase?

2. Working memory: the scratchpad

Working memory is a dedicated space where the agent stores intermediate results, plans, and reasoning notes during a complex task.2 Unlike conversation history (which records what was said), working memory records what the agent is thinking and doing.

For example, a research agent working through a multi-step analysis might maintain a scratchpad with: the current plan, which steps are complete, key findings so far, and open questions. This scratchpad is injected into the context window alongside the conversation history, giving the model access to its own prior reasoning.

Think of it like...

A detective’s case board — the wall covered in photos, notes, and red string connecting clues. It is not a record of conversations; it is a record of active thinking. The detective refers to it constantly while working, and it helps maintain coherence across a long investigation.

3. Long-term memory: persistent knowledge

Long-term memory stores information that persists across sessions — user preferences, past project context, accumulated knowledge, learned facts.4 This is the most architecturally complex type because it requires a storage system external to the model itself.

Common implementations include:

ApproachHow it worksBest for
Vector databasesStore text as embeddings, retrieve by semantic similarityFinding relevant past context from large collections
Key-value storesStore structured facts (user preferences, settings)Quick lookup of specific known information
Knowledge graphsStore entities and relationshipsComplex queries about how things connect
File-based memoryStore summaries and notes as files the agent can readSimple persistence in file-based workflows

The retrieval challenge is central: the agent must decide which long-term memories are relevant to the current task and pull only those into the context window.3 Retrieving too much wastes precious context space. Retrieving too little means the agent lacks critical information.

Concept to explore

See rag for the architectural pattern of retrieving external knowledge and injecting it into the context window at inference time.

4. The compression trade-off

Every memory system faces the same fundamental trade-off: remembering everything vs keeping context focused.2 A model that receives its entire history — every conversation, every fact, every intermediate thought — will drown in irrelevant information and perform poorly. A model that receives too little will miss critical context.

Effective memory engineering uses several compression strategies:

  • Summarisation: Replace a 50-message conversation with a 5-sentence summary
  • Relevance filtering: Only retrieve memories that match the current query
  • Hierarchical memory: Store detailed records but surface only summaries, expanding to detail on demand
  • Forgetting: Deliberately discard information that is outdated or unlikely to be needed again

Key distinction

Context engineering is the broader practice of designing what goes into a model’s context window. Memory engineering is the subset focused specifically on persisting and retrieving information across turns and sessions.5 Both are about the same finite resource — the context window — but memory engineering adds the dimension of time.


Why do we use it?

Key reasons

1. Continuity. Without memory, every interaction is isolated. Memory lets an agent build on prior work, maintain relationships with users, and accumulate knowledge over time — turning disconnected exchanges into a coherent, evolving collaboration.2

2. Efficiency. An agent that remembers a user’s preferences, past decisions, and project context does not need to ask the same questions repeatedly. Memory eliminates redundant work and reduces the cognitive load on the user.4

3. Quality. Decisions informed by historical context are better decisions. An agent that remembers what approaches failed last time, what the user explicitly rejected, or what constraints were discovered mid-project produces higher-quality output.3

4. Trust. Users trust systems that remember them. An agent that forgets your name between sessions feels broken. One that recalls your preferences and builds on prior conversations feels like a genuine collaborator.


When do we use it?

  • When an agent needs to maintain context across multiple sessions (not just within a single conversation)
  • When tasks are long-running and span days or weeks of iterative work
  • When the agent serves repeat users who expect personalisation and continuity
  • When the agent must learn from past interactions — avoiding repeated mistakes or building on accumulated knowledge
  • When the context window is too small to hold all relevant information at once

Rule of thumb

If the agent only needs to handle a single question-and-answer exchange, memory is unnecessary. If the agent needs to remember anything from a previous turn, session, or user interaction, memory is an architectural requirement — not a nice-to-have feature.


How can I think about it?

The three notebooks

Imagine a consultant who carries three notebooks:

  • The conversation notebook (short-term memory): Notes from the current meeting. Written in real time, reviewed during the discussion, and eventually archived or discarded. If the meeting runs very long, the consultant flips back through the notebook but can only hold so many pages in their active attention.
  • The project notebook (working memory): A dedicated scratchpad for the active project. Contains the current plan, open questions, decisions made, and next steps. Carried to every meeting about this project. Thrown away when the project ends.
  • The client file (long-term memory): A folder in the filing cabinet with everything the consultant knows about this client — past projects, preferences, org chart, lessons learned. Pulled out before each meeting, but only the relevant sections are reviewed.

The consultant’s effectiveness depends on all three working together. Miss the conversation notebook and you lose the thread. Miss the project notebook and you lose the plan. Miss the client file and you start from zero every engagement.

The restaurant regular

Think about what happens when a regular customer walks into a restaurant:

  • Short-term memory: The waiter remembers what you ordered tonight and that you asked for no onions (current conversation).
  • Working memory: The kitchen’s order board tracks which courses are prepared, which are pending, and any special modifications (active task state).
  • Long-term memory: The reservation system notes that you are a regular, you prefer the corner table, you are allergic to shellfish, and you always order the house red (persistent profile).

A restaurant that only has short-term memory treats every visit as a first visit. A restaurant with all three memory types makes you feel known and valued. The same principle applies to AI agents — memory is what transforms a transactional tool into a trusted assistant.


Concepts to explore next

ConceptWhat it coversStatus
context-cascadingHow to structure and layer context for AI systemscomplete
ragRetrieving external knowledge and injecting it at inference timestub
knowledge-graphsStoring entities and relationships for structured retrievalstub

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AIML[AI and Machine Learning] --> AS[Agentic Systems]
    AS --> AMEM[Agent Memory]
    AS --> LLM[LLM Pipelines]
    AS --> ORCH[Orchestration]
    style AMEM fill:#4a9ede,color:#fff

Related concepts:

  • context-cascading — a pattern for structuring what information is loaded into the context window and in what order
  • rag — retrieval-augmented generation is a primary implementation pattern for long-term agent memory
  • knowledge-graphs — a structured storage approach for long-term memory that preserves relationships between facts

Sources


Further reading

Resources

Footnotes

  1. Mem0. (2026). State of AI Agent Memory 2026. Mem0. 2

  2. Seah, N. (2026). Memory for AI Agents: A New Paradigm of Context Engineering. The New Stack. 2 3 4

  3. Alake, R. (2025). Architecting Agent Memory: Principles, Patterns, and Best Practices. AI Engineer Conference / MongoDB. 2 3 4

  4. MongoDB. (2025). Don’t Just Build Agents, Build Memory-Augmented AI Agents. MongoDB. 2 3

  5. Bazeley, M. (2026). Why Multi-Agent Systems Need Memory Engineering. O’Reilly Radar.