2026年3月6日

AI Agent 缓存策略：降低成本而不降低质量

AI Agent 缓存策略：降低成本而不降低质量

你的研究 Agent 每天运行 50 个任务。每个任务都以相同的 3,000 个 token 的系统提示词、相同的 2,500 个 token 的工具定义，以及相同的 5,000 个 token 的知识库开始。这意味着每次重复 10,500 个 token，每天共重复 50 次——每天 525,000 个 token 被模型以完全相同的方式读取。按标准定价，你每次都在为处理相同的内容付费。

缓存是生产环境 Agent 中投入最少、收益最大的优化手段。仅提示词缓存就能将缓存内容的输入 token 成本降低 90%。结合工具结果缓存，通常可将 Agent 总成本降低一半——而不会牺牲质量、准确性或灵活性。

在本文中，你将学习 Agent 的三个缓存层次——提示词缓存、工具结果缓存和响应记忆化——包括使用 Claude API 的实现模式，以及指导决策的清晰成本分析。

第一节：Agent 成本构成分析

在优化之前，你需要了解 token 的实际去向。让我们拆解一个典型的 Agent 工作流程。

每轮 Token 分布

在标准 Agent 架构中，每次对模型的 API 调用都包含：

组件	大约 Token 数	是否重复？
系统提示词	~1,500	每轮
工具定义	~2,000	每轮
知识库 / 上下文	~3,000	每轮
对话历史	递增	累积
工具结果	~500（平均）	每次工具调用

静态开销——系统提示词、工具定义和知识库——合计约 6,500 个 token，且每次完全相同。

复合成本问题

Agent 不会只发起一次 API 调用，它们会循环运行。一个典型的 15 步 Agent 运行如下所示：

不使用缓存：

15 个步骤中的每一步都发送完整上下文。假设对话历史平均每步增长约 300 个 token：

每步静态 token 数：6,500
15 步中的静态 token 总数：97,500
对话历史 token 数（累计）：~33,750
工具结果 token 数：~7,500
输入 token 总计：~138,750

按 Claude Sonnet 每百万 token $3 的输入定价，每个任务约需 $0.42。每天运行 50 个任务，仅输入 token 就需花费 $21/天。

使用提示词缓存：

缓存 token 的读取费用为每百万 $0.30（享受 90% 折扣）。若缓存 6,500 个静态 token：

15 步中的缓存 token 读取：97,500 × $0.30/M = $0.029
未缓存 token 按标准定价：~41,250 × $3/M = $0.124
缓存写入（第一步）：6,500 × $3.75/M = $0.024
输入成本合计：每任务约 $0.18

仅靠缓存就实现了 57% 的成本削减。每天 50 个任务，每天节省超过 $12——大约每月节省 $360——而代码改动极少。

第二节：使用 Claude 实现提示词缓存

Claude 的提示词缓存 API 是你可以应用于 Agent 工作流程中影响最大的单一优化手段。它允许你告知 API：“我的提示词中这部分没有变化——请复用缓存的计算结果。“

提示词缓存的工作原理

当你发送一个启用缓存的请求时，Anthropic 的基础设施会：

检查标记前缀的缓存版本是否存在
缓存未命中时： 处理完整提示词，缓存前缀，并收取缓存写入费用（比标准输入定价高 25%）
缓存命中时： 复用缓存的计算结果，仅收取缓存读取费用（比标准输入定价低 90%）

缓存的 TTL（生存时间）为 5 分钟，即缓存数据过期前的持续时长。每次缓存命中都会重置 TTL，因此活跃的 Agent 会自然保持缓存活跃状态。调用间隔超过 5 分钟的批处理任务将更频繁地产生缓存写入成本。

什么内容应该缓存

并非所有内容都能或都应该被缓存。请缓存稳定、重复的前缀：

系统提示词 — 在各次运行中几乎总是相同。优先缓存这部分。
工具定义 — 你的工具 schema 在调用之间很少变化。
静态知识库 — 参考文档、指南、政策。
对话前缀 — 对于多轮对话，缓存不会变化的早期轮次。

重要限制：

缓存是基于前缀的——你只能按顺序从提示词开头缓存内容。你不能在不缓存前面内容的情况下缓存中间某个部分。
Claude Sonnet 和 Haiku 的最小可缓存长度为 1,024 个 token（Opus 为 2,048）。
单个请求中最多可设置 4 个缓存断点。

实现方式

以下是一个展示带提示词缓存的 Agent 设置的完整示例：

之前（不使用缓存）：

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a research assistant specializing in market analysis.
You have access to tools for searching databases, reading files, and performing
calculations. Always cite your sources and provide confidence levels for your
findings. [... detailed instructions totaling ~1,500 tokens ...]"""

TOOLS = [
    {
        "name": "search_database",
        "description": "Search the company database for market data, competitor info, or financial records.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "database": {"type": "string", "enum": ["market", "competitors", "financial"]},
                "limit": {"type": "integer", "description": "Max results", "default": 10}
            },
            "required": ["query", "database"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a research file.",
        "input_schema": {
            "type": "object",
            "properties": {
                "file_path": {"type": "string", "description": "Path to the file"}
            },
            "required": ["file_path"]
        }
    },
    # ... more tools totaling ~2,000 tokens in definitions
]

KNOWLEDGE_BASE = """## Company Policies and Guidelines
[... reference material totaling ~3,000 tokens ...]"""

def run_agent(user_query: str):
    messages = [{"role": "user", "content": user_query}]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=f"{SYSTEM_PROMPT}\n\n{KNOWLEDGE_BASE}",
        tools=TOOLS,
        messages=messages,
    )
    # Log token usage
    print(f"Input tokens: {response.usage.input_tokens}")
    return response

之后（使用提示词缓存）：

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a research assistant specializing in market analysis.
You have access to tools for searching databases, reading files, and performing
calculations. Always cite your sources and provide confidence levels for your
findings. [... detailed instructions totaling ~1,500 tokens ...]"""

KNOWLEDGE_BASE = """## Company Policies and Guidelines
[... reference material totaling ~3,000 tokens ...]"""

TOOLS = [
    {
        "name": "search_database",
        "description": "Search the company database for market data, competitor info, or financial records.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "database": {"type": "string", "enum": ["market", "competitors", "financial"]},
                "limit": {"type": "integer", "description": "Max results", "default": 10}
            },
            "required": ["query", "database"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a research file.",
        "input_schema": {
            "type": "object",
            "properties": {
                "file_path": {"type": "string", "description": "Path to the file"}
            },
            "required": ["file_path"]
        },
        "cache_control": {"type": "ephemeral"}  # Cache breakpoint after tools
    },
]

def run_agent_cached(user_query: str):
    messages = [{"role": "user", "content": user_query}]

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
            },
            {
                "type": "text",
                "text": KNOWLEDGE_BASE,
                "cache_control": {"type": "ephemeral"},  # Cache breakpoint
            },
        ],
        tools=TOOLS,
        messages=messages,
    )
    # Log token usage — now includes cache metrics
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
    return response

关键变更：

system 参数变为内容块列表（而非普通字符串），这样你就可以将 cache_control 附加到特定块上。
在所需断点处

AI Agent 缓存策略：降低成本而不降低质量

AI Agent 缓存策略：降低成本而不降低质量

第一节：Agent 成本构成分析

每轮 Token 分布

复合成本问题

第二节：使用 Claude 实现提示词缓存

提示词缓存的工作原理

什么内容应该缓存

实现方式

相关文章