Mar 6, 2026

Caching Strategies for AI Agents: Cutting Costs Without Cutting Corners

Caching Strategies for AI Agents: Cutting Costs Without Cutting Corners

Your research agent runs 50 tasks a day. Each task starts with the same 3,000-token system prompt, the same 2,500-token tool definitions, and the same 5,000-token knowledge base. That’s 10,500 tokens repeated 50 times — 525,000 tokens per day that the model reads identically every time. At standard pricing, you’re paying to process the same content over and over.

Caching is the lowest-effort, highest-impact optimization for production agents. Prompt caching alone can reduce input token costs by 90% for cached content. Combined with tool result caching, you can often cut total agent costs in half — without sacrificing quality, accuracy, or flexibility.

In this article, you’ll learn three caching layers for agents — prompt caching, tool result caching, and response memoization — with implementation patterns using Claude’s API and a clear cost analysis to guide your decisions.

Section 1: Agent Cost Anatomy

Before optimizing, you need to understand where your tokens are actually going. Let’s break down a typical agentic workflow.

Token Breakdown Per Turn

In a standard agent architecture, every API call to the model includes:

Component	Approx. Tokens	Repeated?
System prompt	~1,500	Every turn
Tool definitions	~2,000	Every turn
Knowledge base / context	~3,000	Every turn
Conversation history	Growing	Accumulates
Tool results	~500 avg	Per tool call

The static overhead — system prompt, tool definitions, and knowledge base — totals roughly 6,500 tokens per turn, and it’s identical every single time.

The Compound Cost Problem

Agents don’t make one API call. They loop. A typical 15-step agent run looks like this:

Without caching:

Each of the 15 steps sends the full context. Assuming conversation history grows by ~300 tokens per step on average:

Static tokens per step: 6,500
Total static tokens across 15 steps: 97,500
Conversation history tokens (cumulative): ~33,750
Tool result tokens: ~7,500
Total input tokens: ~138,750

At Claude Sonnet’s input pricing of $3 per million tokens, that’s about $0.42 per task. Run 50 tasks per day and you’re looking at $21/day — just on input tokens.

With prompt caching:

Cached tokens are charged at $0.30 per million for cache reads (90% discount). If you cache the 6,500 static tokens:

Cached token reads across 15 steps: 97,500 tokens × $0.30/M = $0.029
Non-cached tokens remain at standard pricing: ~41,250 × $3/M = $0.124
Cache write (first step): 6,500 × $3.75/M = $0.024
Total input cost: ~$0.18 per task

That’s a 57% reduction from caching alone. At 50 tasks per day, you save over $12/day — roughly $360/month — with minimal code changes.

Section 2: Prompt Caching with Claude

Claude’s prompt caching API is the single most impactful optimization you can apply to agentic workflows. It lets you tell the API “this part of my prompt hasn’t changed — reuse the cached computation.”

How Prompt Caching Works

When you send a request with caching enabled, Anthropic’s infrastructure:

Checks if a cached version of the marked prefix exists
On cache miss: Processes the full prompt, caches the prefix, and charges a cache write fee (25% premium over standard input pricing)
On cache hit: Reuses the cached computation, charging only the cache read fee (90% discount from standard input pricing)

The cache has a 5-minute TTL (Time to Live — the duration before cached data expires). Every cache hit resets the TTL, so active agents keep their cache alive naturally. Batch jobs with gaps longer than 5 minutes between calls will incur cache write costs more frequently.

What to Cache

Not everything can or should be cached. Cache stable, repeated prefixes:

System prompts — Almost always identical across runs. Cache these first.
Tool definitions — Your tool schemas rarely change between calls.
Static knowledge bases — Reference documents, guidelines, policies.
Conversation prefixes — For multi-turn conversations, cache the earlier turns that won’t change.

Important constraints:

Caching is prefix-based — you can only cache content from the beginning of the prompt, in order. You can’t cache a section in the middle while leaving earlier sections uncached.
There’s a minimum cacheable length of 1,024 tokens for Claude Sonnet and Haiku (2,048 for Opus).
You can set up to 4 cache breakpoints in a single request.

Implementation

Here’s a complete example showing an agent setup with prompt caching:

Before (no caching):

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a research assistant specializing in market analysis.
You have access to tools for searching databases, reading files, and performing
calculations. Always cite your sources and provide confidence levels for your
findings. [... detailed instructions totaling ~1,500 tokens ...]"""

TOOLS = [
    {
        "name": "search_database",
        "description": "Search the company database for market data, competitor info, or financial records.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "database": {"type": "string", "enum": ["market", "competitors", "financial"]},
                "limit": {"type": "integer", "description": "Max results", "default": 10}
            },
            "required": ["query", "database"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a research file.",
        "input_schema": {
            "type": "object",
            "properties": {
                "file_path": {"type": "string", "description": "Path to the file"}
            },
            "required": ["file_path"]
        }
    },
    # ... more tools totaling ~2,000 tokens in definitions
]

KNOWLEDGE_BASE = """## Company Policies and Guidelines
[... reference material totaling ~3,000 tokens ...]"""

def run_agent(user_query: str):
    messages = [{"role": "user", "content": user_query}]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=f"{SYSTEM_PROMPT}\n\n{KNOWLEDGE_BASE}",
        tools=TOOLS,
        messages=messages,
    )
    # Log token usage
    print(f"Input tokens: {response.usage.input_tokens}")
    return response

After (with prompt caching):

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a research assistant specializing in market analysis.
You have access to tools for searching databases, reading files, and performing
calculations. Always cite your sources and provide confidence levels for your
findings. [... detailed instructions totaling ~1,500 tokens ...]"""

KNOWLEDGE_BASE = """## Company Policies and Guidelines
[... reference material totaling ~3,000 tokens ...]"""

TOOLS = [
    {
        "name": "search_database",
        "description": "Search the company database for market data, competitor info, or financial records.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "database": {"type": "string", "enum": ["market", "competitors", "financial"]},
                "limit": {"type": "integer", "description": "Max results", "default": 10}
            },
            "required": ["query", "database"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a research file.",
        "input_schema": {
            "type": "object",
            "properties": {
                "file_path": {"type": "string", "description": "Path to the file"}
            },
            "required": ["file_path"]
        },
        "cache_control": {"type": "ephemeral"}  # Cache breakpoint after tools
    },
]

def run_agent_cached(user_query: str):
    messages = [{"role": "user", "content": user_query}]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
            },
            {
                "type": "text",
                "text": KNOWLEDGE_BASE,
                "cache_control": {"type": "ephemeral"},  # Cache breakpoint
            },
        ],
        tools=TOOLS,
        messages=messages,
    )
    # Log token usage — now includes cache metrics
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
    return response

The key changes:

The system parameter becomes a list of content blocks (instead of a plain string) so you can attach cache_control to specific blocks.
Add "cache_control": {"type": "ephemeral"} at your desired breakpoints.
The last tool definition gets a cache_control block to cache all tool definitions as well.
Monitor the new cache_read_input_tokens and cache_creation_input_tokens fields in the response.

Cache Lifetime Considerations

The 5-minute TTL has practical implications:

Interactive agents (responding to users in real-time): Cache stays warm naturally since calls happen within seconds of each other.
Batch agents (processing queues): If there’s more than 5 minutes between tasks, you’ll pay the cache write fee again. Consider batching tasks to keep the cache warm.
Multi-tenant systems: Cache is scoped to exact content matches. Two agents with identical system prompts share the cache benefit, but even a one-character difference means a cache miss.

Section 3: Tool Result Caching

Prompt caching reduces token costs. Tool result caching reduces execution costs and latency — especially when your tools call expensive external APIs, run database queries, or perform complex computations.

When to Cache Tool Results

The key question is idempotency: does calling the tool with the same arguments always return the same result?

Tool Type	Cacheable?	Examples
Database lookups (read-only)	✅ Yes	`search_database("revenue 2024")`
File reads	✅ Yes	`read_file("report.pdf")`
Static API calls	✅ Yes	`get_exchange_rate("USD", "EUR")` (short TTL)
Write operations	❌ No	`send_email(...)`, `update_record(...)`
Real-time data	⚠️ Carefully	`get_stock_price(...)` (very short TTL)

Cache Key Design

A good cache key uniquely identifies the exact call. Hash the tool name + serialized arguments:

import hashlib
import json
import time
from functools import wraps
from typing import Any, Optional


class ToolResultCache:
    """In-memory cache for tool results with TTL support."""

    def __init__(self):
        self._cache: dict[str, dict[str, Any]] = {}
        self.hits = 0
        self.misses = 0

    def _make_key(self, tool_name: str, arguments: dict) -> str:
        """Create a deterministic cache key from tool name and arguments."""
        key_data = json.dumps(
            {"tool": tool_name, "args": arguments},
            sort_keys=True,
        )
        return hashlib.sha256(key_data.encode()).hexdigest

---

## Related Articles

- [Agent Cost Optimization: A Practical Guide to Reducing API Spend](/blog/agent-cost-optimization-a-practical-guide-to-reducing-api-spend/)
- [Agent Error Recovery: 5 Patterns for Production Reliability](/blog/agent-error-recovery-patterns/)
- [Tool Use Patterns: Building Reliable Agent-Tool Interfaces](/blog/agent-tool-use-patterns/)