How Much Are You Actually Spending on AI Coding Tokens?
Most developers optimize the wrong side of the equation. Here's the math on where your tokens actually go, and what moves the needle.
The bill you're not reading
If you use Claude Code, Cursor, Codex, or any AI coding agent on a real codebase, you're burning through tokens fast. A typical 30-minute Claude Code session on a medium project (50-100 files) consumes 200k-500k tokens.
At Opus pricing ($15/1M input, $75/1M output), that's $3-8 per session. Do 10 sessions a day and you're looking at $30-80/day. For a team of 5, that's $3,000-8,000/month.
But here's the part most people miss:
85-95% of your bill is input tokens, not outputEvery time your agent reads a file, greps for a pattern, or explores the codebase, those tokens are input. The agent's replies (the code it writes, the explanations it gives) are output. Input dominates because agents read far more code than they write.
Where the tokens actually go
We instrumented a week of real Claude Code sessions across 3 projects (Python, TypeScript, Go). Here's the breakdown:
| Activity | Tokens | % of total |
|---|---|---|
| Reading files (Read, cat, head) | ~180k | 45% |
| Search results (Grep, Glob) | ~80k | 20% |
| Conversation context (prior turns) | ~60k | 15% |
| System prompt + instructions | ~40k | 10% |
| Agent output (code + explanations) | ~40k | 10% |
File reads and search results alone account for 65% of all tokens. These are the tokens where the agent pulls in entire files just to find a single function.
The compression trap
When developers notice their token costs, the first instinct is output compression. Tools that make the agent reply more tersely. "Caveman mode." Shorter explanations. Telegraphic prose.
The math on this doesn't work out:
| Approach | Savings | Net bill impact |
|---|---|---|
| Output compression (75% reduction) | 75% of output tokens | ~8% total savings |
| Input retrieval (94% reduction) | 94% of file-read tokens | ~60% total savings |
Output compression saves 75% of 10% of your bill. That's 7.5% off the total.
Input retrieval saves 94% of 65% of your bill. That's 61% off the total.
Output compression and input retrieval aren't competing approaches. They're complementary. But if you're only doing one, do the one that targets 85% of your spend, not 15%.
Why agents read so many tokens
AI coding agents are surprisingly wasteful with file reads. When you ask "how does the auth flow work?", a typical agent will:
- Grep for "auth" across the project (returns 30+ matches)
- Read 3-5 full files that mention auth (800+ lines each)
- Read import chains to understand dependencies
- Read test files for usage examples
Total: 45,000+ tokens of input just to answer one question. The answer uses maybe 200 lines from 2 files. The other 95% of those tokens were noise the agent had to wade through.
What if the agent only got the 200 lines it needed?
That's the core idea behind semantic code indexing. Instead of reading entire files, the agent searches an index and gets back just the relevant functions, classes, and code blocks.
# Without indexing:
Agent reads payments.py (800 lines) = 12,000 tokens
Agent reads shipping.py (600 lines) = 9,000 tokens
Agent reads models.py (1200 lines) = 18,000 tokens
Agent reads test_payments.py (400 lines) = 6,000 tokens
Total: 45,000 tokens
# With semantic search:
context_search("payment flow")
→ process_payment() (40 lines) = 600 tokens
→ PaymentStatus class (15 lines) = 200 tokens
Total: 800 tokens (98% reduction)
This isn't theoretical. We benchmarked this against FastAPI (53 source files, 180K tokens) with 20 real coding questions:
| Metric | Result | |
|---|---|---|
| Token reduction (full-file → chunks) | 94% | |
| Recall@10 (found the right code) | 0.90 | |
| Search latency (p50) | 0.4ms |
94% fewer input tokens with 90% recall. The agent finds the right code 9 out of 10 times, using 1/16th of the tokens.
The full stack of savings
Token savings isn't a single technique. It's a pipeline. Each layer compounds on the previous one:
| Layer | What it does | Savings |
|---|---|---|
| 1. Retrieval | Full files → relevant chunks | 94% |
| 2. Chunk compression | Code chunks → signatures + docstrings | 89% |
| 3. Grammar compression | Drop articles, fillers from memory text | 13% |
| 4. Output compression | Terser agent replies | 25-75% |
Layers 1-3 are input savings (85% of your bill). Layer 4 is output savings (15% of your bill, but at 5x the per-token cost).
Real numbers from real projects
Here's what users see after a week of using semantic code indexing:
my-project · 247 queries · last query 5m ago
⛁ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ 88% tokens saved
Input savings 12.4M tokens $186.00
Output savings 48.2k tokens $3.62
──────────────────────────────────────────
Total saved 12.4M tokens $189.62
Breakdown:
retrieval 84% ▰▰▰▰▰▰▰▰▰▰ 10.4M $156.00 · 247 calls
chunk compression 3% ▰▱▱▱▱▱▱▱▱▱ 421.5k $6.32 · 247 calls
output compression* <1% ▰▱▱▱▱▱▱▱▱▱ 48.2k $3.62 · 312 calls
That's $189 saved in a week on a single project. Retrieval (the input side) accounts for $156 of that. Output compression adds $3.62. Both help, but the ratio is 43:1.
How to set this up (2 minutes)
This is implemented in Code Context Engine (CCE), an open-source MCP server that works with Claude Code, Cursor, VS Code/Copilot, Gemini CLI, and Codex.
uvx --from "code-context-engine[local]" cce init
One command. It indexes your codebase, registers the MCP server, and writes instruction files telling your agent to use context_search instead of reading files directly. No proxy, no API interception, no cloud. Everything runs locally.
After your next coding session:
cce savings
Shows exactly how many tokens and dollars you saved, broken down by layer.
What about provider caching?
Anthropic's prompt caching (90% discount on cache hits) is powerful, but it helps with repeated content across turns. It doesn't help with the first read, and it doesn't reduce what gets sent in the first place.
Semantic retrieval + provider caching is the strongest combination: you send fewer tokens (retrieval), and the tokens you do send are cached across turns (provider cache). They multiply.
The bottom line
If you're spending more than $50/month on AI coding:
- Check your input/output ratio. If input is 80%+, that's your optimization target.
- Semantic retrieval first. It targets the biggest slice of your bill (file reads) with the highest savings rate (94%).
- Output compression second. It helps, especially on output-heavy models (Opus: $75/1M output). But it's a multiplier on a smaller base.
- Both together is best. Retrieval cuts input by 94%. Output compression cuts output by 25-75%. Together they cover the full bill.
Code Context Engine is MIT licensed. 170+ stars, 2,300+ monthly installs. Works with Claude Code, Cursor, VS Code/Copilot, Gemini CLI, OpenAI Codex, OpenCode, and Tabnine.