Agent Memory Systems: Giving Your AI Persistent Context
Every AI agent has a memory problem. Each call to the model is stateless. The model does not remember the last request. It does not remember the user’s name, preferences, or prior decisions. Without external memory, your agent starts from zero on every turn.
This guide covers four memory strategies — from the simplest to the most powerful — and when to use each one. All examples use TypeScript and the Anthropic SDK.
Why Agents Need Memory
Consider a code review agent. On the first run, the user explains their team’s conventions: no any types, always handle errors explicitly, prefer functional style. The agent produces a good review.
On the second run, the user submits another file. The agent has forgotten everything. It misses the same convention violations it flagged yesterday.
This is the core problem: agents are stateless, but useful agents need continuity.
Memory gives agents three capabilities:
- Recall — retrieving facts from earlier in a conversation or from a prior session
- Personalization — adapting to user preferences over time
- Coordination — in multi-agent systems, sharing state between agents
The Four Memory Types
Before writing code, it helps to name the strategies clearly. Agent memory falls into four patterns:
| Type | Where stored | Scope | Best for |
|---|---|---|---|
| Buffer | In-context (messages array) | Current session | Short conversations |
| Summary | In-context (compressed) | Current session | Long conversations |
| Semantic | External (vector DB) | Cross-session | Knowledge recall |
| Episodic | External (key-value store) | Cross-session | User facts, preferences |
Each strategy has a different trade-off between simplicity, cost, and capability. Choose the minimum that solves your problem.
Strategy 1: Conversation Buffer
The simplest strategy is to pass the full conversation history to every model call. The model sees every prior message and responds with full context.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface Message { role: "user" | "assistant"; content: string;}
// The buffer holds the complete conversation historyconst buffer: Message[] = [];
async function chat(userMessage: string): Promise<string> { // Append the new user message to the buffer buffer.push({ role: "user", content: userMessage });
const response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, // Pass the full buffer on every call messages: buffer, });
const assistantMessage = response.content[0].type === "text" ? response.content[0].text : "";
// Append the assistant reply to the buffer buffer.push({ role: "assistant", content: assistantMessage });
return assistantMessage;}
// Example usageawait chat("My name is Alex and I prefer Python over TypeScript.");await chat("What language should I use for my next project?");// The model remembers the preference stated in the first messageWhen to use it: Conversations under 20 turns. Simple chatbots. Prototypes.
The limit: Context windows are finite. A long session eventually exceeds the token limit, and the oldest messages must be dropped. The model loses early context silently, which causes inconsistent behavior.
Strategy 2: Rolling Summary
When conversations run long, summarize old turns instead of dropping them. Maintain a compressed summary of what happened before a sliding window of recent messages.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface ConversationState { summary: string; // Compressed history of older turns recentMessages: Array<{ role: "user" | "assistant"; content: string }>; maxRecentTurns: number; // Keep the last N turns verbatim}
const state: ConversationState = { summary: "", recentMessages: [], maxRecentTurns: 6, // 3 user + 3 assistant turns};
// Compress older turns into the running summaryasync function compressSummary( currentSummary: string, messagesToCompress: Array<{ role: string; content: string }>): Promise<string> { const formatted = messagesToCompress .map((m) => `${m.role.toUpperCase()}: ${m.content}`) .join("\n");
const response = await client.messages.create({ model: "claude-haiku-4-5-20251001", max_tokens: 512, messages: [ { role: "user", content: `Update the conversation summary by adding the new exchanges below.Return only the updated summary. Keep it under 200 words.
Current summary:${currentSummary || "(none)"}
New exchanges:${formatted}`, }, ], });
return response.content[0].type === "text" ? response.content[0].text : currentSummary;}
async function chat(userMessage: string): Promise<string> { // When the buffer exceeds the limit, compress the oldest half if (state.recentMessages.length >= state.maxRecentTurns * 2) { const toCompress = state.recentMessages.splice(0, state.maxRecentTurns); state.summary = await compressSummary(state.summary, toCompress); }
state.recentMessages.push({ role: "user", content: userMessage });
// Build the messages array: system summary + recent verbatim turns const messages = [ ...(state.summary ? [ { role: "user" as const, content: `Context from earlier in this conversation:\n${state.summary}`, }, { role: "assistant" as const, content: "Understood. I will use that context.", }, ] : []), ...state.recentMessages, ];
const response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, messages, });
const assistantMessage = response.content[0].type === "text" ? response.content[0].text : "";
state.recentMessages.push({ role: "assistant", content: assistantMessage }); return assistantMessage;}The summary call uses a fast, cheap model (Haiku). The main conversation uses the capable model (Sonnet). This split keeps costs low while preserving quality.
When to use it: Sessions over 20 turns. Support agents. Long-running task agents.
The limit: Summaries lose detail. Specific facts stated early in a conversation may be compressed into vague statements. For precise recall of specific facts, use episodic memory.
Strategy 3: Episodic Memory
Episodic memory stores specific facts about a user or session in a key-value store. The agent extracts facts during a conversation and retrieves them in future sessions.
import Anthropic from "@anthropic-ai/sdk";import fs from "fs/promises";import path from "path";
const client = new Anthropic();
// In production, replace this file store with Redis or a databaseconst MEMORY_PATH = "./memory.json";
interface EpisodicStore { [userId: string]: Record<string, string>;}
async function loadMemory(): Promise<EpisodicStore> { try { const data = await fs.readFile(MEMORY_PATH, "utf-8"); return JSON.parse(data); } catch { return {}; }}
async function saveMemory(store: EpisodicStore): Promise<void> { await fs.writeFile(MEMORY_PATH, JSON.stringify(store, null, 2));}
// Extract structured facts from the latest exchangeasync function extractFacts( userMessage: string, assistantReply: string): Promise<Record<string, string>> { const response = await client.messages.create({ model: "claude-haiku-4-5-20251001", max_tokens: 256, messages: [ { role: "user", content: `Extract any personal facts, preferences, or important decisions from this exchange.Return a JSON object of key-value pairs. Return {} if there is nothing worth remembering.
User: ${userMessage}Assistant: ${assistantReply}`, }, ], });
try { const text = response.content[0].type === "text" ? response.content[0].text : "{}"; // Extract JSON from the response const match = text.match(/\{[\s\S]*\}/); return match ? JSON.parse(match[0]) : {}; } catch { return {}; }}
// Format stored facts as a system prompt injectionfunction formatMemory(facts: Record<string, string>): string { const entries = Object.entries(facts); if (entries.length === 0) return ""; return ( "What you know about this user:\n" + entries.map(([k, v]) => `- ${k}: ${v}`).join("\n") );}
async function chat(userId: string, userMessage: string): Promise<string> { const store = await loadMemory(); const userFacts = store[userId] ?? {};
const systemPrompt = formatMemory(userFacts);
const response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, system: systemPrompt || undefined, messages: [{ role: "user", content: userMessage }], });
const assistantMessage = response.content[0].type === "text" ? response.content[0].text : "";
// Extract facts and persist them for future sessions const newFacts = await extractFacts(userMessage, assistantMessage); if (Object.keys(newFacts).length > 0) { store[userId] = { ...userFacts, ...newFacts }; await saveMemory(store); }
return assistantMessage;}
// Session 1await chat("user-123", "I prefer concise answers. No bullet points unless necessary.");// Session 2 (a different process, days later)await chat("user-123", "Explain how TCP handshakes work.");// The agent recalls the formatting preference from session 1When to use it: User-facing agents that run across sessions. Personalization. Preference tracking.
The limit: Facts accumulate over time. Stale facts can produce incorrect behavior (a user changes their preferred language; the old preference is still in the store). Add a mechanism to update or expire facts.
Strategy 4: Semantic Memory (Vector Search)
Episodic memory stores discrete facts. Semantic memory stores documents, code, or conversation chunks indexed by meaning. When the agent needs information, it searches the index using a query.
This is the foundation of Retrieval-Augmented Generation (RAG).
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface Document { id: string; content: string; embedding?: number[];}
// Generate an embedding vector for a text stringasync function embed(text: string): Promise<number[]> { // Claude does not expose an embeddings endpoint directly. // Use a dedicated embedding model. This example uses a stub for illustration. // In production: use text-embedding-3-small (OpenAI), embed-english-v3 (Cohere), // or a self-hosted model like nomic-embed-text. throw new Error("Replace this stub with a real embedding call");}
// Calculate cosine similarity between two vectorsfunction cosineSimilarity(a: number[], b: number[]): number { const dot = a.reduce((sum, val, i) => sum + val * b[i], 0); const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0)); const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0)); return dot / (magA * magB);}
class SemanticMemory { private documents: Document[] = [];
// Index a document for later retrieval async add(id: string, content: string): Promise<void> { const embedding = await embed(content); this.documents.push({ id, content, embedding }); }
// Retrieve the top-k most relevant documents for a query async search(query: string, topK = 3): Promise<Document[]> { const queryEmbedding = await embed(query); return this.documents .filter((doc) => doc.embedding) .map((doc) => ({ doc, score: cosineSimilarity(queryEmbedding, doc.embedding!), })) .sort((a, b) => b.score - a.score) .slice(0, topK) .map(({ doc }) => doc); }}
const memory = new SemanticMemory();
async function answerWithContext(question: string): Promise<string> { // Retrieve relevant documents const relevant = await memory.search(question); const context = relevant.map((d) => d.content).join("\n\n---\n\n");
const response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, system: context ? `Answer the question using the following context:\n\n${context}` : undefined, messages: [{ role: "user", content: question }], });
return response.content[0].type === "text" ? response.content[0].text : "";}
// Example: index a knowledge base and query itawait memory.add("doc-1", "The team uses React 18 with the app router pattern.");await memory.add("doc-2", "All API routes must validate input with Zod schemas.");await memory.add("doc-3", "Database queries use Drizzle ORM. Avoid raw SQL.");
const answer = await answerWithContext("How should I write a new API endpoint?");// The agent retrieves doc-2 and doc-3 and generates a grounded answerFor production use, replace the in-memory store with a purpose-built vector database. Pgvector (PostgreSQL extension) is the simplest if you already run Postgres. Chroma and Qdrant are good standalone options.
When to use it: Documentation assistants. Code search agents. Knowledge base Q&A. Anything requiring recall across a large corpus.
Connecting Memory to MCP
If you followed the MCP server guide, you can expose memory as an MCP resource. This makes your memory layer accessible to any MCP-compatible client, including Claude Code.
import { Server } from "@modelcontextprotocol/sdk/server/index.js";import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
const server = new Server( { name: "memory-server", version: "1.0.0" }, { capabilities: { tools: {}, resources: {} } });
// Tool: store a factserver.setRequestHandler("tools/call", async (request) => { if (request.params.name === "remember") { const { key, value, userId } = request.params.arguments as { key: string; value: string; userId: string; }; await storeEpisodicFact(userId, key, value); return { content: [{ type: "text", text: `Stored: ${key} = ${value}` }] }; }
if (request.params.name === "recall") { const { userId } = request.params.arguments as { userId: string }; const facts = await loadEpisodicFacts(userId); return { content: [{ type: "text", text: JSON.stringify(facts, null, 2) }], }; }});
const transport = new StdioServerTransport();await server.connect(transport);Any agent that speaks MCP can now call remember and recall as standard tools. Memory becomes a shared service rather than a per-agent implementation.
Choosing the Right Strategy
Use this decision table as a starting point:
| Situation | Strategy |
|---|---|
| Single session, under 20 turns | Buffer |
| Single session, long-running | Summary |
| Multi-session, user preferences | Episodic |
| Large knowledge base to query | Semantic |
| All of the above | Combine: summary for session + episodic/semantic across sessions |
Start with the buffer. Add a summary layer when conversations grow long. Add episodic memory when users return across sessions. Add semantic memory when you have a corpus of documents to retrieve from.
Do not add complexity before you need it. A conversation buffer is the right answer for most prototypes. Over-engineering memory early adds cost, latency, and maintenance surface area without proportional benefit.
Conclusion
Memory is not one thing — it is a stack of strategies that address different problems at different scopes. The buffer handles the current session. The summary extends that session. Episodic memory bridges sessions. Semantic memory grounds answers in a knowledge base.
Pick the layer that solves the problem you have today. Wrap it behind a clean interface so you can swap implementations later. And if your agents communicate via MCP, expose memory as a shared tool — it keeps each agent simple while giving the system as a whole a persistent brain.
Related Articles
- Building Your First MCP Server
- Multi-Agent Patterns: Orchestrators, Workers, and Pipelines
- Tool Use Patterns: Building Reliable Agent-Tool Interfaces
- Introducing Agentic Development