Agent Cost Optimization: A Practical Guide to Reducing API Spend
Agent Cost Optimization: A Practical Guide to Reducing API Spend
Your agent works great. It handles 200 requests per day, and users are happy. Then you check the API bill: $3,400 this month. You dig into the numbers and realize the agent makes an average of 12 API calls per request, each with a 4,000-token system prompt. That’s 9.6 million input tokens per day just for system prompts. At $3 per million tokens, that’s $864/month on repeated content alone.
Cost is the number one reason production agent deployments get scaled back or killed entirely. Optimization isn’t premature — it’s survival. The good news: most agent deployments have 50–80% cost reduction available through straightforward changes that don’t require rearchitecting your entire system.
In this guide, you’ll learn seven cost reduction strategies, ordered by ROI (Return on Investment). Start at the top and work your way down until you hit your budget target. Each section includes concrete numbers so you can estimate savings before writing a single line of code.
1. Token Accounting: Know Where Your Money Goes
You can’t optimize what you don’t measure. Before changing anything, build a complete picture of where every dollar goes in your agent workflow.
Input Token Breakdown
Every API call to Claude includes several token categories on the input side:
- System prompt — Instructions, persona, constraints. Often 1,000–5,000 tokens and repeated on every call.
- Tool definitions — JSON schemas for each tool the agent can use. 10 tools can easily consume 2,000–3,000 tokens.
- Conversation history — All prior messages in the conversation. Grows with each step.
- Tool results — Outputs from previous tool calls injected back into context. Can be massive (full web pages, database results).
Output Token Breakdown
Output tokens are 3–5x more expensive than input tokens, making them a critical optimization target:
- Agent reasoning — Internal chain-of-thought (especially with extended thinking).
- Tool call generation — The JSON for tool invocations.
- Final response — The user-facing answer.
Per-Task Cost Calculation
The total cost for a single agent task is:
Total Cost = Σ (input_tokens × input_price + output_tokens × output_price) for each API call in the taskCost Breakdown Example
Here’s a realistic breakdown for a 10-step research agent task using Claude Sonnet ($3/M input, $15/M output):
| Step | Component | Input Tokens | Output Tokens | Input Cost | Output Cost | Total |
|---|---|---|---|---|---|---|
| 1 | Initial planning | 4,200 | 800 | $0.0126 | $0.0120 | $0.025 |
| 2 | Web search call | 4,800 | 200 | $0.0144 | $0.0030 | $0.017 |
| 3 | Process search results | 8,500 | 600 | $0.0255 | $0.0090 | $0.035 |
| 4 | Deep reading (page 1) | 12,000 | 500 | $0.0360 | $0.0075 | $0.044 |
| 5 | Deep reading (page 2) | 15,200 | 500 | $0.0456 | $0.0075 | $0.053 |
| 6 | Follow-up search | 16,800 | 200 | $0.0504 | $0.0030 | $0.053 |
| 7 | Process results | 20,100 | 600 | $0.0603 | $0.0090 | $0.069 |
| 8 | Synthesis | 22,500 | 1,200 | $0.0675 | $0.0180 | $0.086 |
| 9 | Verification | 24,000 | 400 | $0.0720 | $0.0060 | $0.078 |
| 10 | Final response | 25,500 | 1,500 | $0.0765 | $0.0225 | $0.099 |
| Total | 153,600 | 6,500 | $0.461 | $0.098 | $0.558 |
Notice how input costs dominate, and they grow with each step as conversation history accumulates. Steps 8–10 account for 47% of total cost despite being only 30% of the steps.
Cost Tracking Code
Implement a tracking wrapper from day one:
import anthropicimport timefrom dataclasses import dataclass, fieldfrom typing import Optional
@dataclassclass CostRecord: step: int model: str input_tokens: int output_tokens: int cache_read_tokens: int = 0 cache_creation_tokens: int = 0 input_cost: float = 0.0 output_cost: float = 0.0 total_cost: float = 0.0 duration_ms: float = 0.0
# Pricing per million tokens (as of early 2026)MODEL_PRICING = { "claude-haiku": {"input": 0.25, "output": 1.25, "cache_read": 0.025, "cache_write": 0.30}, "claude-sonnet": {"input": 3.00, "output": 15.00, "cache_read": 0.30, "cache_write": 3.75}, "claude-opus": {"input": 15.00, "output": 75.00, "cache_read": 1.50, "cache_write": 18.75},}
@dataclassclass TaskCostTracker: task_id: str budget_limit: Optional[float] = None records: list = field(default_factory=list) total_cost: float = 0.0
def record_call(self, step: int, model: str, usage) -> CostRecord: pricing = MODEL_PRICING.get(model, MODEL_PRICING["claude-sonnet"])
input_cost = (usage.input_tokens / 1_000_000) * pricing["input"] output_cost = (usage.output_tokens / 1_000_000) * pricing["output"] cache_read_cost = (getattr(usage, 'cache_read_input_tokens', 0) / 1_000_000) * pricing["cache_read"] cache_write_cost = (getattr(usage, 'cache_creation_input_tokens', 0) / 1_000_000) * pricing["cache_write"]
total = input_cost + output_cost + cache_read_cost + cache_write_cost
record = CostRecord( step=step, model=model, input_tokens=usage.input_tokens, output_tokens=usage.output_tokens, cache_read_tokens=getattr(usage, 'cache_read_input_tokens', 0), cache_creation_tokens=getattr(usage, 'cache_creation_input_tokens', 0), input_cost=input_cost + cache_read_cost + cache_write_cost, output_cost=output_cost, total_cost=total, )
self.records.append(record) self.total_cost += total
print(f" Step {step} [{model}]: {usage.input_tokens} in / {usage.output_tokens} out = ${total:.4f} (cumulative: ${self.total_cost:.4f})")
if self.budget_limit and self.total_cost > self.budget_limit: raise BudgetExceededError( f"Task {self.task_id} exceeded budget: ${self.total_cost:.4f} > ${self.budget_limit:.4f}" )
return record
def summary(self) -> dict: return { "task_id": self.task_id, "total_steps": len(self.records), "total_input_tokens": sum(r.input_tokens for r in self.records), "total_output_tokens": sum(r.output_tokens for r in self.records), "total_cost": self.total_cost, "cost_by_model": self._cost_by_model(), }
def _cost_by_model(self) -> dict: by_model = {} for r in self.records: if r.model not in by_model: by_model[r.model] = {"calls": 0, "cost": 0.0} by_model[r.model]["calls"] += 1 by_model[r.model]["cost"] += r.total_cost return by_model
class BudgetExceededError(Exception): passStart tracking today, even before optimizing. You need baseline numbers to measure improvement.
2. Model Selection by Task
This is the single biggest cost lever available to you. Not every agent step requires your most powerful model.
Model Tiers and Pricing
Using the Claude family as reference:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|
| Haiku | $0.25 | $1.25 | Classification, routing, extraction, simple formatting |
| Sonnet | $3.00 | $15.00 | Complex tool use, research, synthesis, general-purpose |
| Opus | $15.00 | $75.00 | Complex planning, nuanced reasoning, high-stakes decisions |
Haiku is 12x cheaper than Sonnet on input and 12x cheaper on output. Sonnet is 5x cheaper than Opus across the board.
Per-Role Recommendations
Match each agent role to the cheapest model that meets quality requirements:
- Router/classifier agents → Haiku. “Is this a billing question or a technical question?” doesn’t need Sonnet. Haiku handles classification with >95% accuracy for well-defined categories.
- Data extraction agents → Haiku. Pulling structured fields from text, parsing dates, extracting entities — Haiku excels here.
- Worker agents (tool use) → Sonnet. Complex multi-step tool orchestration, research synthesis, and nuanced responses benefit from Sonnet’s capabilities.
- Orchestrator/planner agents → Sonnet or Opus. Planning quality directly impacts total cost (fewer steps = fewer API calls), so investing in a better planner can pay for itself.
- Verification agents → Sonnet for most cases; Opus for high-stakes verification where errors are expensive.
Savings Calculation
Consider an agent handling 200 requests/day with this call pattern per request:
| Step | Role | Calls | Input Tokens | Output Tokens |
|---|---|---|---|---|
| Route | Classification | 1 | 1,000 | 50 |
| Extract | Data extraction | 2 | 2,000 | 300 |
| Process | Complex reasoning | 6 | 8,000 | 2,000 |
| Verify | Quality check | 1 | 3,000 | 200 |
All Sonnet (baseline):
- Input: (1,000 + 4,000 + 48,000 + 3,000) × 200 = 11.2M tokens/day → $33.60/day
- Output: (50 + 600 + 12,000 + 200) × 200 = 2.57M tokens/day → $38.55/day
- Total: $72.15/day = $2,165/month
Mixed model approach:
- Route (Haiku): 200K input → $0.05, 10K output → $0.0125
- Extract (Haiku): 800K input → $0.20, 120K output → $0.15
- Process (Sonnet): 9.6M input → $28.80, 2.4M output → $36.00
- Verify (Sonnet): 600K input → $1.80, 40K output → $0.60
- Total: $67.61/day = $2,028/month
That’s only a 6% savings — but in practice, once you’ve verified Ha
Related Articles
- Caching Strategies for AI Agents: Cutting Costs Without Cutting Corners
- Reasoning Models in Agent Workflows: When Extended Thinking Pays Off
- Multi-Agent Patterns: Orchestrators, Workers, and Pipelines
- Debugging and Observability in Autonomous Agent Systems