Mar 4, 2026

Agent Memory Systems: Giving Your AI Persistent Context

Every AI agent has a memory problem. Each call to the model is stateless. The model does not remember the last request. It does not remember the user’s name, preferences, or prior decisions. Without external memory, your agent starts from zero on every turn.

This guide covers four memory strategies — from the simplest to the most powerful — and when to use each one. All examples use TypeScript and the Anthropic SDK.

Why Agents Need Memory

Consider a code review agent. On the first run, the user explains their team’s conventions: no any types, always handle errors explicitly, prefer functional style. The agent produces a good review.

On the second run, the user submits another file. The agent has forgotten everything. It misses the same convention violations it flagged yesterday.

This is the core problem: agents are stateless, but useful agents need continuity.

Memory gives agents three capabilities:

Recall — retrieving facts from earlier in a conversation or from a prior session
Personalization — adapting to user preferences over time
Coordination — in multi-agent systems, sharing state between agents

The Four Memory Types

Before writing code, it helps to name the strategies clearly. Agent memory falls into four patterns:

Type	Where stored	Scope	Best for
Buffer	In-context (messages array)	Current session	Short conversations
Summary	In-context (compressed)	Current session	Long conversations
Semantic	External (vector DB)	Cross-session	Knowledge recall
Episodic	External (key-value store)	Cross-session	User facts, preferences

Each strategy has a different trade-off between simplicity, cost, and capability. Choose the minimum that solves your problem.

Strategy 1: Conversation Buffer

The simplest strategy is to pass the full conversation history to every model call. The model sees every prior message and responds with full context.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface Message {
  role: "user" | "assistant";
  content: string;
}

// The buffer holds the complete conversation history
const buffer: Message[] = [];

async function chat(userMessage: string): Promise<string> {
  // Append the new user message to the buffer
  buffer.push({ role: "user", content: userMessage });

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    // Pass the full buffer on every call
    messages: buffer,
  });

  const assistantMessage =
    response.content[0].type === "text" ? response.content[0].text : "";

  // Append the assistant reply to the buffer
  buffer.push({ role: "assistant", content: assistantMessage });

  return assistantMessage;
}

// Example usage
await chat("My name is Alex and I prefer Python over TypeScript.");
await chat("What language should I use for my next project?");
// The model remembers the preference stated in the first message

When to use it: Conversations under 20 turns. Simple chatbots. Prototypes.

The limit: Context windows are finite. A long session eventually exceeds the token limit, and the oldest messages must be dropped. The model loses early context silently, which causes inconsistent behavior.

Strategy 2: Rolling Summary

When conversations run long, summarize old turns instead of dropping them. Maintain a compressed summary of what happened before a sliding window of recent messages.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface ConversationState {
  summary: string;       // Compressed history of older turns
  recentMessages: Array<{ role: "user" | "assistant"; content: string }>;
  maxRecentTurns: number; // Keep the last N turns verbatim
}

const state: ConversationState = {
  summary: "",
  recentMessages: [],
  maxRecentTurns: 6, // 3 user + 3 assistant turns
};

// Compress older turns into the running summary
async function compressSummary(
  currentSummary: string,
  messagesToCompress: Array<{ role: string; content: string }>
): Promise<string> {
  const formatted = messagesToCompress
    .map((m) => `${m.role.toUpperCase()}: ${m.content}`)
    .join("\n");

  const response = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 512,
    messages: [
      {
        role: "user",
        content: `Update the conversation summary by adding the new exchanges below.
Return only the updated summary. Keep it under 200 words.

Current summary:
${currentSummary || "(none)"}

New exchanges:
${formatted}`,
      },
    ],
  });

  return response.content[0].type === "text" ? response.content[0].text : currentSummary;
}

async function chat(userMessage: string): Promise<string> {
  // When the buffer exceeds the limit, compress the oldest half
  if (state.recentMessages.length >= state.maxRecentTurns * 2) {
    const toCompress = state.recentMessages.splice(0, state.maxRecentTurns);
    state.summary = await compressSummary(state.summary, toCompress);
  }

  state.recentMessages.push({ role: "user", content: userMessage });

  // Build the messages array: system summary + recent verbatim turns
  const messages = [
    ...(state.summary
      ? [
          {
            role: "user" as const,
            content: `Context from earlier in this conversation:\n${state.summary}`,
          },
          {
            role: "assistant" as const,
            content: "Understood. I will use that context.",
          },
        ]
      : []),
    ...state.recentMessages,
  ];

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages,
  });

  const assistantMessage =
    response.content[0].type === "text" ? response.content[0].text : "";

  state.recentMessages.push({ role: "assistant", content: assistantMessage });
  return assistantMessage;
}

The summary call uses a fast, cheap model (Haiku). The main conversation uses the capable model (Sonnet). This split keeps costs low while preserving quality.

When to use it: Sessions over 20 turns. Support agents. Long-running task agents.

The limit: Summaries lose detail. Specific facts stated early in a conversation may be compressed into vague statements. For precise recall of specific facts, use episodic memory.

Strategy 3: Episodic Memory

Episodic memory stores specific facts about a user or session in a key-value store. The agent extracts facts during a conversation and retrieves them in future sessions.

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs/promises";
import path from "path";

const client = new Anthropic();

// In production, replace this file store with Redis or a database
const MEMORY_PATH = "./memory.json";

interface EpisodicStore {
  [userId: string]: Record<string, string>;
}

async function loadMemory(): Promise<EpisodicStore> {
  try {
    const data = await fs.readFile(MEMORY_PATH, "utf-8");
    return JSON.parse(data);
  } catch {
    return {};
  }
}

async function saveMemory(store: EpisodicStore): Promise<void> {
  await fs.writeFile(MEMORY_PATH, JSON.stringify(store, null, 2));
}

// Extract structured facts from the latest exchange
async function extractFacts(
  userMessage: string,
  assistantReply: string
): Promise<Record<string, string>> {
  const response = await client.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Extract any personal facts, preferences, or important decisions from this exchange.
Return a JSON object of key-value pairs. Return {} if there is nothing worth remembering.

User: ${userMessage}
Assistant: ${assistantReply}`,
      },
    ],
  });

  try {
    const text = response.content[0].type === "text" ? response.content[0].text : "{}";
    // Extract JSON from the response
    const match = text.match(/\{[\s\S]*\}/);
    return match ? JSON.parse(match[0]) : {};
  } catch {
    return {};
  }
}

// Format stored facts as a system prompt injection
function formatMemory(facts: Record<string, string>): string {
  const entries = Object.entries(facts);
  if (entries.length === 0) return "";
  return (
    "What you know about this user:\n" +
    entries.map(([k, v]) => `- ${k}: ${v}`).join("\n")
  );
}

async function chat(userId: string, userMessage: string): Promise<string> {
  const store = await loadMemory();
  const userFacts = store[userId] ?? {};

  const systemPrompt = formatMemory(userFacts);

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: systemPrompt || undefined,
    messages: [{ role: "user", content: userMessage }],
  });

  const assistantMessage =
    response.content[0].type === "text" ? response.content[0].text : "";

  // Extract facts and persist them for future sessions
  const newFacts = await extractFacts(userMessage, assistantMessage);
  if (Object.keys(newFacts).length > 0) {
    store[userId] = { ...userFacts, ...newFacts };
    await saveMemory(store);
  }

  return assistantMessage;
}

// Session 1
await chat("user-123", "I prefer concise answers. No bullet points unless necessary.");
// Session 2 (a different process, days later)
await chat("user-123", "Explain how TCP handshakes work.");
// The agent recalls the formatting preference from session 1

When to use it: User-facing agents that run across sessions. Personalization. Preference tracking.

The limit: Facts accumulate over time. Stale facts can produce incorrect behavior (a user changes their preferred language; the old preference is still in the store). Add a mechanism to update or expire facts.

Strategy 4: Semantic Memory (Vector Search)

Episodic memory stores discrete facts. Semantic memory stores documents, code, or conversation chunks indexed by meaning. When the agent needs information, it searches the index using a query.

This is the foundation of Retrieval-Augmented Generation (RAG).

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface Document {
  id: string;
  content: string;
  embedding?: number[];
}

// Generate an embedding vector for a text string
async function embed(text: string): Promise<number[]> {
  // Claude does not expose an embeddings endpoint directly.
  // Use a dedicated embedding model. This example uses a stub for illustration.
  // In production: use text-embedding-3-small (OpenAI), embed-english-v3 (Cohere),
  // or a self-hosted model like nomic-embed-text.
  throw new Error("Replace this stub with a real embedding call");
}

// Calculate cosine similarity between two vectors
function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (magA * magB);
}

class SemanticMemory {
  private documents: Document[] = [];

  // Index a document for later retrieval
  async add(id: string, content: string): Promise<void> {
    const embedding = await embed(content);
    this.documents.push({ id, content, embedding });
  }

  // Retrieve the top-k most relevant documents for a query
  async search(query: string, topK = 3): Promise<Document[]> {
    const queryEmbedding = await embed(query);
    return this.documents
      .filter((doc) => doc.embedding)
      .map((doc) => ({
        doc,
        score: cosineSimilarity(queryEmbedding, doc.embedding!),
      }))
      .sort((a, b) => b.score - a.score)
      .slice(0, topK)
      .map(({ doc }) => doc);
  }
}

const memory = new SemanticMemory();

async function answerWithContext(question: string): Promise<string> {
  // Retrieve relevant documents
  const relevant = await memory.search(question);
  const context = relevant.map((d) => d.content).join("\n\n---\n\n");

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: context
      ? `Answer the question using the following context:\n\n${context}`
      : undefined,
    messages: [{ role: "user", content: question }],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

// Example: index a knowledge base and query it
await memory.add("doc-1", "The team uses React 18 with the app router pattern.");
await memory.add("doc-2", "All API routes must validate input with Zod schemas.");
await memory.add("doc-3", "Database queries use Drizzle ORM. Avoid raw SQL.");

const answer = await answerWithContext("How should I write a new API endpoint?");
// The agent retrieves doc-2 and doc-3 and generates a grounded answer

For production use, replace the in-memory store with a purpose-built vector database. Pgvector (PostgreSQL extension) is the simplest if you already run Postgres. Chroma and Qdrant are good standalone options.

When to use it: Documentation assistants. Code search agents. Knowledge base Q&A. Anything requiring recall across a large corpus.

Connecting Memory to MCP

If you followed the MCP server guide, you can expose memory as an MCP resource. This makes your memory layer accessible to any MCP-compatible client, including Claude Code.

import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";

const server = new Server(
  { name: "memory-server", version: "1.0.0" },
  { capabilities: { tools: {}, resources: {} } }
);

// Tool: store a fact
server.setRequestHandler("tools/call", async (request) => {
  if (request.params.name === "remember") {
    const { key, value, userId } = request.params.arguments as {
      key: string;
      value: string;
      userId: string;
    };
    await storeEpisodicFact(userId, key, value);
    return { content: [{ type: "text", text: `Stored: ${key} = ${value}` }] };
  }

  if (request.params.name === "recall") {
    const { userId } = request.params.arguments as { userId: string };
    const facts = await loadEpisodicFacts(userId);
    return {
      content: [{ type: "text", text: JSON.stringify(facts, null, 2) }],
    };
  }
});

const transport = new StdioServerTransport();
await server.connect(transport);

Any agent that speaks MCP can now call remember and recall as standard tools. Memory becomes a shared service rather than a per-agent implementation.

Choosing the Right Strategy

Use this decision table as a starting point:

Situation	Strategy
Single session, under 20 turns	Buffer
Single session, long-running	Summary
Multi-session, user preferences	Episodic
Large knowledge base to query	Semantic
All of the above	Combine: summary for session + episodic/semantic across sessions

Start with the buffer. Add a summary layer when conversations grow long. Add episodic memory when users return across sessions. Add semantic memory when you have a corpus of documents to retrieve from.

Do not add complexity before you need it. A conversation buffer is the right answer for most prototypes. Over-engineering memory early adds cost, latency, and maintenance surface area without proportional benefit.

Conclusion

Memory is not one thing — it is a stack of strategies that address different problems at different scopes. The buffer handles the current session. The summary extends that session. Episodic memory bridges sessions. Semantic memory grounds answers in a knowledge base.

Pick the layer that solves the problem you have today. Wrap it behind a clean interface so you can swap implementations later. And if your agents communicate via MCP, expose memory as a shared tool — it keeps each agent simple while giving the system as a whole a persistent brain.