Caching Strategies for AI Agents: Cutting Costs Without Cutting Corners
Caching Strategies for AI Agents: Cutting Costs Without Cutting Corners
Your research agent runs 50 tasks a day. Each task starts with the same 3,000-token system prompt, the same 2,500-token tool definitions, and the same 5,000-token knowledge base. That’s 10,500 tokens repeated 50 times — 525,000 tokens per day that the model reads identically every time. At standard pricing, you’re paying to process the same content over and over.
Caching is the lowest-effort, highest-impact optimization for production agents. Prompt caching alone can reduce input token costs by 90% for cached content. Combined with tool result caching, you can often cut total agent costs in half — without sacrificing quality, accuracy, or flexibility.
In this article, you’ll learn three caching layers for agents — prompt caching, tool result caching, and response memoization — with implementation patterns using Claude’s API and a clear cost analysis to guide your decisions.
Section 1: Agent Cost Anatomy
Before optimizing, you need to understand where your tokens are actually going. Let’s break down a typical agentic workflow.
Token Breakdown Per Turn
In a standard agent architecture, every API call to the model includes:
| Component | Approx. Tokens | Repeated? |
|---|---|---|
| System prompt | ~1,500 | Every turn |
| Tool definitions | ~2,000 | Every turn |
| Knowledge base / context | ~3,000 | Every turn |
| Conversation history | Growing | Accumulates |
| Tool results | ~500 avg | Per tool call |
The static overhead — system prompt, tool definitions, and knowledge base — totals roughly 6,500 tokens per turn, and it’s identical every single time.
The Compound Cost Problem
Agents don’t make one API call. They loop. A typical 15-step agent run looks like this:
Without caching:
Each of the 15 steps sends the full context. Assuming conversation history grows by ~300 tokens per step on average:
- Static tokens per step: 6,500
- Total static tokens across 15 steps: 97,500
- Conversation history tokens (cumulative): ~33,750
- Tool result tokens: ~7,500
- Total input tokens: ~138,750
At Claude Sonnet’s input pricing of $3 per million tokens, that’s about $0.42 per task. Run 50 tasks per day and you’re looking at $21/day — just on input tokens.
With prompt caching:
Cached tokens are charged at $0.30 per million for cache reads (90% discount). If you cache the 6,500 static tokens:
- Cached token reads across 15 steps: 97,500 tokens × $0.30/M = $0.029
- Non-cached tokens remain at standard pricing: ~41,250 × $3/M = $0.124
- Cache write (first step): 6,500 × $3.75/M = $0.024
- Total input cost: ~$0.18 per task
That’s a 57% reduction from caching alone. At 50 tasks per day, you save over $12/day — roughly $360/month — with minimal code changes.
Section 2: Prompt Caching with Claude
Claude’s prompt caching API is the single most impactful optimization you can apply to agentic workflows. It lets you tell the API “this part of my prompt hasn’t changed — reuse the cached computation.”
How Prompt Caching Works
When you send a request with caching enabled, Anthropic’s infrastructure:
- Checks if a cached version of the marked prefix exists
- On cache miss: Processes the full prompt, caches the prefix, and charges a cache write fee (25% premium over standard input pricing)
- On cache hit: Reuses the cached computation, charging only the cache read fee (90% discount from standard input pricing)
The cache has a 5-minute TTL (Time to Live — the duration before cached data expires). Every cache hit resets the TTL, so active agents keep their cache alive naturally. Batch jobs with gaps longer than 5 minutes between calls will incur cache write costs more frequently.
What to Cache
Not everything can or should be cached. Cache stable, repeated prefixes:
- System prompts — Almost always identical across runs. Cache these first.
- Tool definitions — Your tool schemas rarely change between calls.
- Static knowledge bases — Reference documents, guidelines, policies.
- Conversation prefixes — For multi-turn conversations, cache the earlier turns that won’t change.
Important constraints:
- Caching is prefix-based — you can only cache content from the beginning of the prompt, in order. You can’t cache a section in the middle while leaving earlier sections uncached.
- There’s a minimum cacheable length of 1,024 tokens for Claude Sonnet and Haiku (2,048 for Opus).
- You can set up to 4 cache breakpoints in a single request.
Implementation
Here’s a complete example showing an agent setup with prompt caching:
Before (no caching):
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a research assistant specializing in market analysis.You have access to tools for searching databases, reading files, and performingcalculations. Always cite your sources and provide confidence levels for yourfindings. [... detailed instructions totaling ~1,500 tokens ...]"""
TOOLS = [ { "name": "search_database", "description": "Search the company database for market data, competitor info, or financial records.", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "database": {"type": "string", "enum": ["market", "competitors", "financial"]}, "limit": {"type": "integer", "description": "Max results", "default": 10} }, "required": ["query", "database"] } }, { "name": "read_file", "description": "Read the contents of a research file.", "input_schema": { "type": "object", "properties": { "file_path": {"type": "string", "description": "Path to the file"} }, "required": ["file_path"] } }, # ... more tools totaling ~2,000 tokens in definitions]
KNOWLEDGE_BASE = """## Company Policies and Guidelines[... reference material totaling ~3,000 tokens ...]"""
def run_agent(user_query: str): messages = [{"role": "user", "content": user_query}]
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=f"{SYSTEM_PROMPT}\n\n{KNOWLEDGE_BASE}", tools=TOOLS, messages=messages, ) # Log token usage print(f"Input tokens: {response.usage.input_tokens}") return responseAfter (with prompt caching):
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a research assistant specializing in market analysis.You have access to tools for searching databases, reading files, and performingcalculations. Always cite your sources and provide confidence levels for yourfindings. [... detailed instructions totaling ~1,500 tokens ...]"""
KNOWLEDGE_BASE = """## Company Policies and Guidelines[... reference material totaling ~3,000 tokens ...]"""
TOOLS = [ { "name": "search_database", "description": "Search the company database for market data, competitor info, or financial records.", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "database": {"type": "string", "enum": ["market", "competitors", "financial"]}, "limit": {"type": "integer", "description": "Max results", "default": 10} }, "required": ["query", "database"] } }, { "name": "read_file", "description": "Read the contents of a research file.", "input_schema": { "type": "object", "properties": { "file_path": {"type": "string", "description": "Path to the file"} }, "required": ["file_path"] }, "cache_control": {"type": "ephemeral"} # Cache breakpoint after tools },]
def run_agent_cached(user_query: str): messages = [{"role": "user", "content": user_query}]
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=[ { "type": "text", "text": SYSTEM_PROMPT, }, { "type": "text", "text": KNOWLEDGE_BASE, "cache_control": {"type": "ephemeral"}, # Cache breakpoint }, ], tools=TOOLS, messages=messages, ) # Log token usage — now includes cache metrics usage = response.usage print(f"Input tokens: {usage.input_tokens}") print(f"Cache read tokens: {usage.cache_read_input_tokens}") print(f"Cache creation tokens: {usage.cache_creation_input_tokens}") return responseThe key changes:
- The
systemparameter becomes a list of content blocks (instead of a plain string) so you can attachcache_controlto specific blocks. - Add
"cache_control": {"type": "ephemeral"}at your desired breakpoints. - The last tool definition gets a
cache_controlblock to cache all tool definitions as well. - Monitor the new
cache_read_input_tokensandcache_creation_input_tokensfields in the response.
Cache Lifetime Considerations
The 5-minute TTL has practical implications:
- Interactive agents (responding to users in real-time): Cache stays warm naturally since calls happen within seconds of each other.
- Batch agents (processing queues): If there’s more than 5 minutes between tasks, you’ll pay the cache write fee again. Consider batching tasks to keep the cache warm.
- Multi-tenant systems: Cache is scoped to exact content matches. Two agents with identical system prompts share the cache benefit, but even a one-character difference means a cache miss.
Section 3: Tool Result Caching
Prompt caching reduces token costs. Tool result caching reduces execution costs and latency — especially when your tools call expensive external APIs, run database queries, or perform complex computations.
When to Cache Tool Results
The key question is idempotency: does calling the tool with the same arguments always return the same result?
| Tool Type | Cacheable? | Examples |
|---|---|---|
| Database lookups (read-only) | ✅ Yes | search_database("revenue 2024") |
| File reads | ✅ Yes | read_file("report.pdf") |
| Static API calls | ✅ Yes | get_exchange_rate("USD", "EUR") (short TTL) |
| Write operations | ❌ No | send_email(...), update_record(...) |
| Real-time data | ⚠️ Carefully | get_stock_price(...) (very short TTL) |
Cache Key Design
A good cache key uniquely identifies the exact call. Hash the tool name + serialized arguments:
import hashlibimport jsonimport timefrom functools import wrapsfrom typing import Any, Optional
class ToolResultCache: """In-memory cache for tool results with TTL support."""
def __init__(self): self._cache: dict[str, dict[str, Any]] = {} self.hits = 0 self.misses = 0
def _make_key(self, tool_name: str, arguments: dict) -> str: """Create a deterministic cache key from tool name and arguments.""" key_data = json.dumps( {"tool": tool_name, "args": arguments}, sort_keys=True, ) return hashlib.sha256(key_data.encode()).hexdigest
---
## Related Articles
- [Agent Cost Optimization: A Practical Guide to Reducing API Spend](/blog/agent-cost-optimization-a-practical-guide-to-reducing-api-spend/)- [Agent Error Recovery: 5 Patterns for Production Reliability](/blog/agent-error-recovery-patterns/)- [Tool Use Patterns: Building Reliable Agent-Tool Interfaces](/blog/agent-tool-use-patterns/)