May 17, 2026 · Sovont · 3 min read

The Cost Spike You Didn't See Coming

Nobody models LLM costs seriously until they get the bill. By then, the architecture is already wrong.

AI Production

The invoice arrives. It’s three times what you expected. You pull up the dashboard, squint at the token counts, and start doing the math you should have done before you shipped.

This is the most common LLM production failure mode. Not hallucinations. Not latency. Cost. It’s the one teams consistently underestimate, and it’s the one that gets the most uncomfortable attention from finance.

How it happens.

You prototype with a small prompt against a small dataset. Everything looks reasonable — a few cents per call, maybe a dollar a day. You ship. Usage scales. The prompt grows because someone added more context, more instructions, more examples. The model version changes. A new feature triggers the same LLM call twice in one workflow. Nobody counted tokens — they watched outputs.

The math compounds silently. Input tokens, output tokens, system prompts on every call, RAG context injected at retrieval time. By the time the spike registers, the architecture that generated it is already in production and already expected by users.

Token cost is a first-class architecture concern.

Not a post-launch tuning problem. Not an ops issue. An architecture concern — decided at design time, tracked like any other resource.

That means knowing your token budget per request before you write the first line of integration. Input tokens: what’s the system prompt size, what context gets injected, what’s the user payload ceiling? Output tokens: what’s the expected response length, what happens when the model gets verbose? Multiply by expected call volume. That’s your cost model — rough, but real.

If you don’t have that number before you ship, you’re flying blind.

The common mistakes:

System prompts that grow without governance. Every time a prompt fix gets pushed, it adds tokens. Nobody tracks the size. A 500-token system prompt that grew to 2,000 over four months means you’re paying 4x for every call and don’t know it.

Context injection with no ceiling. RAG retrieval that dumps top-10 chunks into context, regardless of how much they cost. A relevance threshold that doesn’t also consider token budget. Documents that vary wildly in length, with no truncation logic.

Output verbosity that nobody tuned. “Respond in JSON” with no max token limit. An LLM that writes a paragraph when a sentence would do. Output length directly controls cost — and most teams set it once and forget it.

No per-feature cost tracking. Every LLM-powered feature has a cost profile. Most systems don’t log it at the feature level. When spend spikes, you’re diffing the entire usage curve instead of isolating the feature that changed.

What to do instead.

Set token budgets per call — and enforce them. Track prompt sizes in version control and flag regressions. Log token usage at the feature level, not just the aggregate. Set cost alerts before they become cost surprises. Review spend weekly during early production, not quarterly during planning.

The same discipline you’d apply to database query plans applies here. You wouldn’t ship a query with no idea of its execution cost. Don’t ship an LLM call without one either.

The bill was always coming. The only variable is whether you saw it before it arrived or after.

Build the cost model first. Everything else is easier from there.