At what token volume does on-premise AI infrastructure become cost-effective?

According to Lenovo's 2026 TCO report , for high-utilization workloads (sustained utilization above 20%), on-premise infrastructure reaches its break-even point in under four months. An analysis published on arXiv in 2025 places the real economic viability threshold at around 50 million tokens per month for mid-size models.

What is the Jevons paradox as applied to AI?

The Jevons paradox describes the phenomenon by which a fall in the unit cost of a resource triggers such a large increase in total consumption that it cancels the expected savings. Applied to AI: the collapse in cost per million tokens ( over 98% in two years ) was accompanied by a multiplication of use cases (agentic workflows, RAG architectures, persistent agents) large enough that total spending increased despite the falling unit cost.

What is model-agnostic architecture and why prioritize it?

A model-agnostic architecture allows substituting one language model for another without application redesign, by abstracting the inference layer. It preserves migration options if provider pricing shifts or if on-premise infrastructure becomes more cost-effective. This is a design decision that must be made at the initial architecture stage.

Enterprise AI Infrastructure: The Cheap Token Paradox

Last April, I was working on a plane. Eleven hours of flight time, a local language model running on my MacBook. The model was Llama 3.2 3B. I submitted questions about Bayesian networks (my doctoral research addresses computational causality applied to legal reasoning) and the model generated text with deplorable accuracy. I asked whether a child’s blood type is conditionally independent of their grandparents’ blood type given their parents’ blood type, which is a trivial question in causal inference (the Markov condition). The model’s answer: no. I submitted an argument showing why the answer was yes. The model acknowledged the correction. I then restated its original error as an objection. It reverted to its initial position. Watching a model oscillate between two contradictory answers depending on conversational pressure has something strangely compelling about it. I kept asking myself: what infrastructure would I need to run a genuinely reliable model locally?

The Dominant View at the Start of 2025

The enterprise AI infrastructure question seemed relatively straightforward at the beginning of 2025. The dividing line was clear: if an organization faced a legal obligation of data sovereignty (regulated financial sector, defense, healthcare under HIPAA or equivalent), local or private cloud deployment was mandatory. In all other cases, commercial APIs from OpenAI, Anthropic, or Google represented the rational choice: better models, low marginal cost, zero infrastructure to maintain.

This position rested on an implicit assumption: the token was cheap, and its cost would continue to fall. It was a defensible position, though already at that time practitioners like León Palafox were sounding the alarm that cheap tokens would not last.

The Jevons Paradox Applied to Inference

Between early 2024 and early 2026, the cost of AI inference per million tokens fell by more than 98%. A GPT-4 call cost approximately $60 per million output tokens at the start of that period; comparable models are available today at between $0.10 and $0.75. The great token deflation happened. Yet according to data published by Gartner in January 2026, worldwide spending on AI infrastructure software nearly quadrupled over the same period, rising from $60 billion to $230 billion.

This is the Jevons paradox applied to inference: when the unit cost collapses, consumption expands enough to cancel the expected savings. The mechanism is structural. Agentic workflows (chains of AI agents that call each other to complete a task) can trigger between a few and roughly twenty LLM calls per user task, depending on complexity. RAG architectures, which let a model query a document base before responding, inflate context windows, sometimes by a factor of five to nine depending on the number of retrieved documents. Persistent monitoring agents consume compute around the clock. Goldman Sachs Research projects that global token consumption will be multiplied by twenty-four by 2030, reaching 120 quadrillion tokens per month. That is a projection, worth as much as projections are worth, but it captures the directional trend.

Some organizations have turned this into a talent argument: Jensen Huang, Nvidia’s CEO, declared in March 2026 that he would be «deeply alarmed» if his $500,000-per-year engineers were not consuming at least $250,000 worth of tokens annually.

Cases That Changed the Conversation

Uber deployed Claude Code to its 5,000 engineers in December 2025. By April 2026, CTO Praveen Neppalli Naga confirmed, in a report published by The Information, that the company had exhausted its entire annual AI budget in four months. Adoption had grown from 32% to 84% of engineering teams. Monthly cost per engineer ranged from $500 to $2,000 depending on usage intensity. Naga described the situation as a complete restart on budget planning.

Meme: Batman slapping Robin: cheap tokens do not mean lower AI budgets

Microsoft, a few weeks later, announced the cancellation of its internal Claude Code licenses in its Experiences and Devices division (Windows, Microsoft 365, Surface), shifting to GitHub Copilot CLI. The chosen transition date, June 30, 2026, aligns precisely with the close of Microsoft’s fiscal year.

In the healthcare sector, an undisclosed company consumed one trillion tokens over six months, generating over six million dollars in unplanned costs before the finance team understood what was producing them. The term «tokenmaxxing» has entered the CIO vocabulary.

The Mechanics of Cost Runaway

Every language model API call is stateless: the model retains no memory of the previous call. An agent completing a task across twenty steps must send, at each step, the complete conversation history up to that point. By step twenty, if each step involved reading files or documents, the input context window can exceed fifty thousand tokens. At Claude Sonnet 4.6 pricing ($3 per million input tokens), a single late step in an agent loop costs $0.15. Multiplied by fifty steps, fifty tasks per developer per day, twenty developers, over twenty-two working days: $110,000 per month for a twenty-person team.

The pricing model for commercial APIs follows a metered logic: every input and output token is billed. The classic per-seat SaaS subscription with a fixed price applies only to consumer-facing interfaces (and even then, with per-seat token limits). The moment an organization moves to the API (i.e., the moment it builds something), it enters a variable consumption economy, with a spread exceeding 600x between the cheapest model and the most expensive frontier reasoning models.

The Ground Shifting: The Hidden Subsidy

OpenAI projects a $14 billion loss in 2026, against annualized revenues that exceeded $20 billion at end of 2025. Anthropic, whose revenues reached $45 billion annualized in May 2026, pushed its cash-flow-positive target to 2028, after initially setting it at 2027. Both companies, along with their competitors, price inference below their actual production cost. The objective is market share capture during the adoption phase. Financial markets fund the gap.

An organization that builds its business processes on APIs whose current pricing is subsidized by venture capital is building on unstable ground. When valuations impose discipline, prices rise. The repricing risk is documented in these companies’ own financial projections. And it receives remarkably little attention.

What Total Cost of Ownership Analyses Show

The 2026 edition of Lenovo’s generative AI TCO report, based on a comparison with equivalent instances at AWS and GCP, finds that for high-utilization workloads (above 20% continuous utilization), on-premise infrastructure reaches break-even in under four months. The per-token cost advantage reaches 8x relative to cloud IaaS and up to 18x relative to frontier Model-as-a-Service APIs.

An analysis published on arXiv in November 2025 refines this finding: for small models, break-even occurs within a few months; for mid-size models, around two years; for large models, five years. This makes local deployment economically justified primarily for organizations processing more than fifty million tokens per month, or operating under strict data residency requirements. Deloitte sets the viability threshold at the point where on-premise costs reach 60 to 70% of the cloud equivalent.

The decisive variable is not headcount but the token volume generated by automations. A fifty-person organization running intensive agentic workflows can cross this threshold before a five-hundred-person organization with light conversational usage. This does not mean one is right and profitable while the other is wrong and unprofitable. These are simply the parameters of the mathematical problem.

The Necessary Revision

The initial position (commercial APIs unless legally required otherwise) remains defensible for light, non-agentic conversational uses. It warrants revision as soon as an organization deploys or plans to deploy autonomous agents, RAG pipelines in production, or any workflow that generates LLM calls in the background without direct user interaction.

For these use cases, a three-year TCO analysis must precede the infrastructure decision, structured around two variables: the expected growth rate of token volume, and the repricing risk from providers once the subsidy phase ends.

Model-agnostic architecture (building so that one language model can be substituted for another without application redesign) is the engineering decision to make today, independent of the infrastructure choice. It preserves migration options if pricing shifts. If this has not been addressed yet, it warrants attention.

Implications for Generative AI Strategy

In my work with organizations on generative AI strategy, the budget question consistently arrives late in the conversation, frequently after architectural decisions have already been made. This is the reverse of the correct sequence.

The right approach: first identify high-transaction-volume use cases, model the token volumes they generate over twelve and thirty-six months, then select the infrastructure.

On my MacBook, I now run DeepSeek-R1-Distill-Qwen-7B, a 7-billion-parameter quantized model running locally without a network connection. It takes three minutes to answer a complex causality question. It generally produces the correct answer on the first attempt and does not change its position when contradicted without argument. Latency is the price of reasoning sovereignty. In the context of strategic analysis, that may be an acceptable trade-off. The prerequisite, however, is having defined the task to be solved.