June 2026 · 8 min read

How Much Are You Actually Spending on AI Coding Tokens?

Most developers optimize the wrong side of the equation. Here's the math on where your tokens actually go, and what moves the needle.

The bill you're not reading

If you use Claude Code, Cursor, Codex, or any AI coding agent on a real codebase, you're burning through tokens fast. A typical 30-minute Claude Code session on a medium project (50-100 files) consumes 200k-500k tokens.

At Opus pricing ($15/1M input, $75/1M output), that's $3-8 per session. Do 10 sessions a day and you're looking at $30-80/day. For a team of 5, that's $3,000-8,000/month.

But here's the part most people miss:

85-95% of your bill is input tokens, not output

Every time your agent reads a file, greps for a pattern, or explores the codebase, those tokens are input. The agent's replies (the code it writes, the explanations it gives) are output. Input dominates because agents read far more code than they write.

Where the tokens actually go

We instrumented a week of real Claude Code sessions across 3 projects (Python, TypeScript, Go). Here's the breakdown:

Activity	Tokens	% of total
Reading files (Read, cat, head)	~180k	45%
Search results (Grep, Glob)	~80k	20%
Conversation context (prior turns)	~60k	15%
System prompt + instructions	~40k	10%
Agent output (code + explanations)	~40k	10%

File reads and search results alone account for 65% of all tokens. These are the tokens where the agent pulls in entire files just to find a single function.

The compression trap

When developers notice their token costs, the first instinct is output compression. Tools that make the agent reply more tersely. "Caveman mode." Shorter explanations. Telegraphic prose.

The math on this doesn't work out:

Approach	Savings	Net bill impact
Output compression (75% reduction)	75% of output tokens	~8% total savings
Input retrieval (94% reduction)	94% of file-read tokens	~60% total savings

Output compression saves 75% of 10% of your bill. That's 7.5% off the total.

Input retrieval saves 94% of 65% of your bill. That's 61% off the total.

Output compression and input retrieval aren't competing approaches. They're complementary. But if you're only doing one, do the one that targets 85% of your spend, not 15%.

Why agents read so many tokens

AI coding agents are surprisingly wasteful with file reads. When you ask "how does the auth flow work?", a typical agent will:

Grep for "auth" across the project (returns 30+ matches)
Read 3-5 full files that mention auth (800+ lines each)
Read import chains to understand dependencies
Read test files for usage examples

Total: 45,000+ tokens of input just to answer one question. The answer uses maybe 200 lines from 2 files. The other 95% of those tokens were noise the agent had to wade through.

What if the agent only got the 200 lines it needed?

That's the core idea behind semantic code indexing. Instead of reading entire files, the agent searches an index and gets back just the relevant functions, classes, and code blocks.

# Without indexing:
Agent reads payments.py (800 lines)     =  12,000 tokens
Agent reads shipping.py (600 lines)     =   9,000 tokens
Agent reads models.py (1200 lines)      =  18,000 tokens
Agent reads test_payments.py (400 lines) =   6,000 tokens
Total: 45,000 tokens

# With semantic search:
context_search("payment flow")
  → process_payment() (40 lines)        =     600 tokens
  → PaymentStatus class (15 lines)      =     200 tokens
Total: 800 tokens (98% reduction)

This isn't theoretical. We benchmarked this against FastAPI (53 source files, 180K tokens) with 20 real coding questions:

Metric	Result
Token reduction (full-file → chunks)	94%
Recall@10 (found the right code)	0.90
Search latency (p50)	0.4ms

94% fewer input tokens with 90% recall. The agent finds the right code 9 out of 10 times, using 1/16th of the tokens.

The full stack of savings

Token savings isn't a single technique. It's a pipeline. Each layer compounds on the previous one:

Layer	What it does	Savings
1. Retrieval	Full files → relevant chunks	94%
2. Chunk compression	Code chunks → signatures + docstrings	89%
3. Grammar compression	Drop articles, fillers from memory text	13%
4. Output compression	Terser agent replies	25-75%

Layers 1-3 are input savings (85% of your bill). Layer 4 is output savings (15% of your bill, but at 5x the per-token cost).

Real numbers from real projects

Here's what users see after a week of using semantic code indexing:

  my-project · 247 queries · last query 5m ago

  ⛁ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶  88% tokens saved

  Input savings   12.4M  tokens   $186.00
  Output savings  48.2k  tokens   $3.62
  ──────────────────────────────────────────
  Total saved   12.4M  tokens   $189.62

  Breakdown:
    retrieval              84%  ▰▰▰▰▰▰▰▰▰▰   10.4M  $156.00 · 247 calls
    chunk compression       3%  ▰▱▱▱▱▱▱▱▱▱   421.5k    $6.32 · 247 calls
    output compression*    <1%  ▰▱▱▱▱▱▱▱▱▱    48.2k    $3.62 · 312 calls

That's $189 saved in a week on a single project. Retrieval (the input side) accounts for $156 of that. Output compression adds $3.62. Both help, but the ratio is 43:1.

How to set this up (2 minutes)

This is implemented in Code Context Engine (CCE), an open-source MCP server that works with Claude Code, Cursor, VS Code/Copilot, Gemini CLI, and Codex.

uvx --from "code-context-engine[local]" cce init

One command. It indexes your codebase, registers the MCP server, and writes instruction files telling your agent to use context_search instead of reading files directly. No proxy, no API interception, no cloud. Everything runs locally.

After your next coding session:

cce savings

Shows exactly how many tokens and dollars you saved, broken down by layer.

What about provider caching?

Anthropic's prompt caching (90% discount on cache hits) is powerful, but it helps with repeated content across turns. It doesn't help with the first read, and it doesn't reduce what gets sent in the first place.

Semantic retrieval + provider caching is the strongest combination: you send fewer tokens (retrieval), and the tokens you do send are cached across turns (provider cache). They multiply.

The bottom line

If you're spending more than $50/month on AI coding:

Check your input/output ratio. If input is 80%+, that's your optimization target.
Semantic retrieval first. It targets the biggest slice of your bill (file reads) with the highest savings rate (94%).
Output compression second. It helps, especially on output-heavy models (Opus: $75/1M output). But it's a multiplier on a smaller base.
Both together is best. Retrieval cuts input by 94%. Output compression cuts output by 25-75%. Together they cover the full bill.

Try Code Context Engine (free, open source) →

Code Context Engine is MIT licensed. 170+ stars, 2,300+ monthly installs. Works with Claude Code, Cursor, VS Code/Copilot, Gemini CLI, OpenAI Codex, OpenCode, and Tabnine.