June 2026 · 8 min read

How Much Are You Actually Spending on AI Coding Tokens?

Most developers optimize the wrong side of the equation. Here's the math on where your tokens actually go, and what moves the needle.

The bill you're not reading

If you use Claude Code, Cursor, Codex, or any AI coding agent on a real codebase, you're burning through tokens fast. A typical 30-minute Claude Code session on a medium project (50-100 files) consumes 200k-500k tokens.

At Opus pricing ($15/1M input, $75/1M output), that's $3-8 per session. Do 10 sessions a day and you're looking at $30-80/day. For a team of 5, that's $3,000-8,000/month.

But here's the part most people miss:

85-95% of your bill is input tokens, not output

Every time your agent reads a file, greps for a pattern, or explores the codebase, those tokens are input. The agent's replies (the code it writes, the explanations it gives) are output. Input dominates because agents read far more code than they write.

Where the tokens actually go

We instrumented a week of real Claude Code sessions across 3 projects (Python, TypeScript, Go). Here's the breakdown:

ActivityTokens% of total
Reading files (Read, cat, head)~180k45%
Search results (Grep, Glob)~80k20%
Conversation context (prior turns)~60k15%
System prompt + instructions~40k10%
Agent output (code + explanations)~40k10%

File reads and search results alone account for 65% of all tokens. These are the tokens where the agent pulls in entire files just to find a single function.

The compression trap

When developers notice their token costs, the first instinct is output compression. Tools that make the agent reply more tersely. "Caveman mode." Shorter explanations. Telegraphic prose.

The math on this doesn't work out:

ApproachSavingsNet bill impact
Output compression (75% reduction)75% of output tokens~8% total savings
Input retrieval (94% reduction)94% of file-read tokens~60% total savings

Output compression saves 75% of 10% of your bill. That's 7.5% off the total.

Input retrieval saves 94% of 65% of your bill. That's 61% off the total.

Output compression and input retrieval aren't competing approaches. They're complementary. But if you're only doing one, do the one that targets 85% of your spend, not 15%.

Why agents read so many tokens

AI coding agents are surprisingly wasteful with file reads. When you ask "how does the auth flow work?", a typical agent will:

  1. Grep for "auth" across the project (returns 30+ matches)
  2. Read 3-5 full files that mention auth (800+ lines each)
  3. Read import chains to understand dependencies
  4. Read test files for usage examples

Total: 45,000+ tokens of input just to answer one question. The answer uses maybe 200 lines from 2 files. The other 95% of those tokens were noise the agent had to wade through.

What if the agent only got the 200 lines it needed?

That's the core idea behind semantic code indexing. Instead of reading entire files, the agent searches an index and gets back just the relevant functions, classes, and code blocks.

# Without indexing:
Agent reads payments.py (800 lines)     =  12,000 tokens
Agent reads shipping.py (600 lines)     =   9,000 tokens
Agent reads models.py (1200 lines)      =  18,000 tokens
Agent reads test_payments.py (400 lines) =   6,000 tokens
Total: 45,000 tokens

# With semantic search:
context_search("payment flow")
  → process_payment() (40 lines)        =     600 tokens
  → PaymentStatus class (15 lines)      =     200 tokens
Total: 800 tokens (98% reduction)

This isn't theoretical. We benchmarked this against FastAPI (53 source files, 180K tokens) with 20 real coding questions:

MetricResult
Token reduction (full-file → chunks)94%
Recall@10 (found the right code)0.90
Search latency (p50)0.4ms

94% fewer input tokens with 90% recall. The agent finds the right code 9 out of 10 times, using 1/16th of the tokens.

The full stack of savings

Token savings isn't a single technique. It's a pipeline. Each layer compounds on the previous one:

LayerWhat it doesSavings
1. RetrievalFull files → relevant chunks94%
2. Chunk compressionCode chunks → signatures + docstrings89%
3. Grammar compressionDrop articles, fillers from memory text13%
4. Output compressionTerser agent replies25-75%

Layers 1-3 are input savings (85% of your bill). Layer 4 is output savings (15% of your bill, but at 5x the per-token cost).

Real numbers from real projects

Here's what users see after a week of using semantic code indexing:

  my-project · 247 queries · last query 5m ago

  ⛁ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶  88% tokens saved

  Input savings   12.4M  tokens   $186.00
  Output savings  48.2k  tokens   $3.62
  ──────────────────────────────────────────
  Total saved   12.4M  tokens   $189.62

  Breakdown:
    retrieval              84%  ▰▰▰▰▰▰▰▰▰▰   10.4M  $156.00 · 247 calls
    chunk compression       3%  ▰▱▱▱▱▱▱▱▱▱   421.5k    $6.32 · 247 calls
    output compression*    <1%  ▰▱▱▱▱▱▱▱▱▱    48.2k    $3.62 · 312 calls

That's $189 saved in a week on a single project. Retrieval (the input side) accounts for $156 of that. Output compression adds $3.62. Both help, but the ratio is 43:1.

How to set this up (2 minutes)

This is implemented in Code Context Engine (CCE), an open-source MCP server that works with Claude Code, Cursor, VS Code/Copilot, Gemini CLI, and Codex.

uvx --from "code-context-engine[local]" cce init

One command. It indexes your codebase, registers the MCP server, and writes instruction files telling your agent to use context_search instead of reading files directly. No proxy, no API interception, no cloud. Everything runs locally.

After your next coding session:

cce savings

Shows exactly how many tokens and dollars you saved, broken down by layer.

What about provider caching?

Anthropic's prompt caching (90% discount on cache hits) is powerful, but it helps with repeated content across turns. It doesn't help with the first read, and it doesn't reduce what gets sent in the first place.

Semantic retrieval + provider caching is the strongest combination: you send fewer tokens (retrieval), and the tokens you do send are cached across turns (provider cache). They multiply.

The bottom line

If you're spending more than $50/month on AI coding:

Try Code Context Engine (free, open source) →

Code Context Engine is MIT licensed. 170+ stars, 2,300+ monthly installs. Works with Claude Code, Cursor, VS Code/Copilot, Gemini CLI, OpenAI Codex, OpenCode, and Tabnine.