The Token Squeeze

Here is the memo your manager sent last quarter: "We expect all engineers to integrate AI tools into their daily workflow." Here is the Slack message from finance this week: "Please be mindful of token consumption — we're seeing overages." Here is the question nobody is asking out loud: which one of these do you want me to do?

This is the token squeeze, and it's coming for every developer who uses AI coding tools in 2026. Companies are simultaneously pushing harder on AI adoption mandates and tightening the economic constraints on actually using the tools. The developers caught in the middle are being graded on velocity and token efficiency — two goals that pull in opposite directions.

The ones who figure out how to navigate this contradiction won't be the heaviest AI users or the lightest. They'll be the most judicious ones.

The Squeeze from Above

The mandates are not subtle.

Shopify CEO Tobi Lutke published an internal memo declaring that "using AI effectively is now a fundamental expectation of everyone at Shopify." Teams must prove why jobs can't be done by AI before hiring new headcount. AI usage questions were added to performance reviews.¹

Coinbase CEO Brian Armstrong went further — asking developers to personally justify not using AI tools. Some who couldn't were fired.² Block, under Jack Dorsey, has cut over 4,000 jobs — nearly half its workforce — across multiple rounds of AI-motivated layoffs, while mandating every remaining employee use generative AI daily.³ Duolingo went "AI-first," offboarded 10% of contractors, and put AI effectiveness into performance reviews. Klarna cut 40% of its headcount citing AI replacement.⁴

At Amazon, approximately 1,500 engineers endorsed an internal forum post urging access to Claude Code after the company said it wouldn't support additional third-party AI tools.⁵ The message was clear: developers want these tools, even when companies restrict them.

LeadDev reported that developers describe "executives instituting OKRs and tracking AI usage without any regard for whether it's actually helping." Nearly half of C-suite executives admitted that AI adoption was "tearing their company apart" — and 75% of leaders believed their rollout was successful while only 45% of employees agreed.²

The pressure is real, it's top-down, and it's tied to your performance review. Use more AI. Use it for everything. Why aren't you using it?

The Squeeze from Below

Now here's what happened to the tools this week.

On March 19, Windsurf killed its credit-based pricing and replaced it with daily quotas. One developer calculated a 10-20x overnight price increase for Opus usage. Another found that a single code review consumed 8% of his entire weekly quota. Under credits, you could bank unused allocation for heavy days. Under quotas, unused daily tokens vanish at midnight.⁶

On March 16, GitHub discovered a bug in Copilot's rate-limiting system that had been undercounting tokens from newer models. When they fixed it, the corrected counts pushed paying Pro+ users over their limits — locking them out across all models after as few as two requests. GitHub's official response received 38 thumbs-down and 2 thumbs-up.⁷

Cursor users reported burning 40 million tokens in half a day on the Ultra plan. One team's $7,000 annual subscription was depleted in a single day.⁸ A year earlier, Cursor had already faced a pricing revolt when a switch from fixed requests to credit-based pricing effectively cut monthly requests from 500 to 225, prompting a public apology from the CEO.⁹

Anthropic introduced weekly rate limits on the $200 Max plan specifically targeting heavy Claude Code users, acknowledging that the average power user costs the company roughly $6/day in API-equivalent spend — and that some consume far more.¹⁰

Google quietly slashed Gemini API free quotas by 50-92% in December 2025, causing tens of thousands of developer projects to grind to a halt overnight.¹¹

The pattern is industry-wide. Every major platform has moved away from unlimited-use models toward credit, quota, and token-based systems. The tools your company is mandating you use are simultaneously making themselves harder and more expensive to use.

Use more AI.
Spend less on AI.
Pick one.

The Developer in the Middle

One developer tracked eight months of daily Claude Code usage across roughly 10 billion tokens. The API-equivalent cost would have been over $15,000. Peak month: $5,623 in API-equivalent value on a subscription that costs $200.¹² That's why Anthropic added rate limits — they were losing thousands per heavy user per month.

Enterprise-wide, the math is daunting. A 500-developer team on GitHub Copilot Business faces roughly $114,000 per year in base subscription costs alone. On Cursor Business: $192,000. On Tabnine Enterprise: $234,000. And these are floor costs — token overages, premium model surcharges, and agentic usage multipliers push real bills 2-5x higher for heavy users.¹³

Meanwhile, McKinsey reports that roughly 80% of companies that have deployed generative AI see no material impact on earnings. Only 6% qualify as "high performers" who attribute more than 5% of earnings to AI.¹⁴ Gartner places generative AI squarely in the "Trough of Disillusionment" throughout 2026, even as worldwide AI spending is projected to hit $2.5 trillion.¹⁵

CIOs are setting aside 9% of IT budgets just for price increases on existing software that now includes AI features they may not have asked for. The cost is not optional — the AI is baked into the tools whether you want it or not.¹⁵

And the individual developer — the person sitting between the mandate and the budget — is watching their trust erode. The Stack Overflow 2025 Developer Survey found that AI tool usage climbed to 84% while trust fell to 29%. Forty-six percent actively distrust AI accuracy. Sixty-six percent say they spend more time fixing "almost-right" AI code than the code is worth. Sixty-eight percent report spending more time on security vulnerabilities now than before using AI tools.¹⁶

Use more AI. Spend less on AI. The AI doesn't really work. But use it anyway. And be efficient about it.

· · ·

The New Skill

Jensen Huang thinks he has the answer. At GTC in March 2026, the Nvidia CEO proposed giving engineers a token budget equivalent to roughly half their base salary — $100,000 to $150,000 per year in compute credits — on top of their salary.¹⁷ Tomasz Tunguz at Theory Ventures predicts that AI inference will join salary, bonus, and equity as the fourth standard component of tech compensation. Engineers at Meta and OpenAI are already competing on internal leaderboards that track token consumption.¹⁸

Critics note that tokens are not cash. They have no portable value outside the employer's tools. They don't compound like equity. They don't show up in your next-offer negotiation. You can't pay rent with tokens.

But Huang's proposal reveals the underlying shift: token efficiency is becoming a measurable skill. Just as cloud computing created the FinOps discipline — where engineers learned to optimize infrastructure spend alongside feature development — AI tools are creating what might be called TokenOps or AI FinOps.

In 2024, 31% of FinOps teams managed AI spend. In 2025, it was 63%. In 2026, it's 98%.¹⁹ The discipline went from niche to universal in two years. The cloud parallel is instructive: AWS cut prices over 100 times in a decade, yet enterprise cloud bills grew every year because usage expanded faster than unit costs fell. The same dynamic is playing out with tokens.

LLM inference prices are falling at roughly 10x per year for equivalent performance — a widely cited benchmark from a16z and industry analysts.²⁰ GPT-4-equivalent performance that cost $20 per million tokens in late 2022 now costs under a dollar. And it doesn't matter, because usage is growing faster than costs are falling.

The Penny Wise, Pound Foolish Loop

The most counterintuitive finding in the cost data is what researchers are calling the LLM Cost Paradox: cheaper models are actually breaking budgets.²¹

The mechanism is straightforward. Output tokens cost 3-10x more than input tokens, but companies focus on input price when comparing models. A model with low input costs and high retry rates generates more output tokens — and more output tokens cost more — than an expensive model that gets it right the first time.

The true metric is cost per successful output. If you spend $100 on a cheap model and get 50 usable results, your cost is $2 per successful output. If you spend $150 on an expensive model and get 140 usable results, your cost is $1.07. The expensive model is 47% cheaper in practice.

Then there are the hidden engineering costs. A senior engineer spending ten hours tuning prompts for a cheaper model incurs significant labor cost that a capable model avoids with simpler prompts. And the downstream costs are even larger: low-quality outputs create missed bugs, security vulnerabilities shipped to production, and technical debt that accumulates silently until it doesn't.²²

The practical recommendation from the research: use a cheaper model for roughly 70% of routine tasks and reserve the expensive model for the 30% that's complex. This yields better ROI than going all-in on either extreme.²¹

But this requires a skill most developers haven't been asked to develop: knowing which tasks are which.

· · ·

The Winners

In 2005, a freestyle chess tournament was won by two American amateurs — Steven Cramton, rated 1685, and Zackary Stephen, rated 1398 — using three ordinary laptops. They defeated grandmaster-led teams and the chess supercomputer Hydra. They weren't better chess players. They had a better process for using their tools.

Garry Kasparov, reflecting on this, wrote the most cited sentence in centaur chess history: "Weak human + machine + better process was superior to a strong computer alone and, more remarkably, superior to a strong human + machine + inferior process."²³

The parallel to today's token squeeze is direct. The developers who will excel in 2026 are not the ones who prompt the most or the least. They're the ones who develop the best process for deciding:

When to use Opus versus Sonnet versus Haiku. Haiku is 12x cheaper than Sonnet for simple tasks. Three-tier routing alone can cut costs 40-70% while maintaining output quality.²⁴

When to use prompt caching. Caching can reduce costs by 60-95% for repetitive context windows — the single most powerful optimization technique available.²⁴

When to write code yourself. This is the one nobody talks about. Sometimes the most token-efficient approach is to just write the damn code. If you can solve something in twenty minutes of focused work, burning tokens on three rounds of prompting, reviewing, debugging, and re-prompting is not efficiency — it's waste. The METR study found experienced developers were 19% slower with AI on their own repositories, despite believing they were 20% faster.²⁵ For tasks where you already have deep context, the AI may be costing you time and tokens.

When to decompose before prompting. A well-decomposed task that takes one precise prompt costs a fraction of a vague task that takes six rounds of clarification. Concise prompt engineering — eliminating filler, providing targeted context, being specific about outputs — yields a 30-50% token reduction on its own.²⁴

Anthropic's own engineering guidance makes the point explicitly: tools should "return information that is token efficient" and "encourage efficient agent behaviors."²⁶ Even the model's maker is telling you to use fewer tokens.

The New Performance Review

Here's what the performance review of 2027 might look like: not "did you use AI?" or "how fast did you ship?" but "what was your cost per successful output?"

The developer who ships ten features with $500 in token spend will be more valuable than the one who ships twelve features with $5,000 in token spend. The one who knows that this task is a Haiku task and that task is an Opus task — and who can explain why — will outperform the one who runs everything through the most expensive model because that's what the default is set to.

This is not a hypothetical. Meta and OpenAI engineers are already on internal leaderboards tracking token consumption.¹⁸ Flexera is building token-level cost attribution into its FinOps platform.¹⁹ IDC warns that by 2027, large organizations will face up to a 30% rise in underestimated AI infrastructure costs.²⁷

The squeeze will only tighten. And the developers who've already internalized token efficiency as a craft skill — not a constraint but a discipline, like writing clean code or managing technical debt — will be the ones who come out the other side.

· · ·

The best centaur chess players weren't the ones who ran the most positions through the engine. They were the ones who knew which positions to run.

That's the skill that matters now. Not whether you use AI. Not how much AI you use. But whether you know, for any given problem, the most efficient path from question to answer — and whether that path runs through a $25-per-million-token model, a $5 model, a $0.60 model, or your own brain.

Your company won't tell you this. They'll tell you to use more AI. Then they'll tell you to spend less on it. The ones who figure out that these are the same instruction — use AI better, not more — will be the ones who survive the squeeze.

Disclosure

This article was written with the assistance of Claude, an AI made by Anthropic — one of the platforms whose pricing is discussed above. The author used multiple model tiers during research and drafting, which felt like an appropriate way to write about token efficiency. We have tried to represent all platforms' pricing and policies accurately. Corrections welcome at bustah@sloppish.com.

Citations

CNBC, "Shopify CEO says staffers need to prove jobs can't be done by AI before asking for more headcount," April 2025. Link.
LeadDev, "AI coding mandates are driving developers to the brink." Link.
MetaIntro, "Jack Dorsey, Block Layoffs & AI Mandates," February 2026. Link.
WebProNews, "Klarna's AI Gamble: Halving Staff While Boosting Paychecks." Link. Duolingo: CNBC.
The Register, "Devs gripe about having AI shoved down their throats," November 2025. Link.
Efficienist, "Windsurf Abandons Flexible Credit System for Strict Quotas." Link. Trustpilot reviews: Link.
GitHub Community Discussion #189990. Link.
Startup Hakk, "Claude Code vs Cursor: The Hidden Costs." Link. UserJot: Link.
TechCrunch, "Cursor apologizes for unclear pricing changes that upset users," July 2025. Link.
TechCrunch, "Anthropic unveils new rate limits to curb Claude Code power users," July 2025. Link. Average usage data: Martin Alderson.
Google Gemini API quota reduction, December 2025. Reported across developer forums.
ClaudeLog, "Claude Code Pricing." 8-month usage tracking across ~10 billion tokens. Link.
SitePoint, "AI Coding Tools Cost Analysis & ROI Calculator 2026." Link. Palma AI: Link.
McKinsey, "The state of AI in 2025." Link.
Gartner, IT Spending Forecast 2026. Link. AI spending: Link.
Stack Overflow 2025 Developer Survey. Over 49,000 respondents. AI section. Security vulnerability time increase: DEV Community.
CNBC, "Nvidia's Huang pitches AI tokens on top of salary," March 2026. Link.
TechCrunch, "Are AI tokens the new signing bonus — or just a cost of doing business?" March 2026. Link.
State of FinOps 2026 Report. Link. Flexera: Link.
The ~10x annual price decline for equivalent LLM performance is a widely cited industry benchmark. Epoch AI's "LLM Inference Price Trends" (Link) reports higher figures using specific benchmark methodologies; we use the more conservative a16z/industry consensus estimate. The GPT-4 to GPT-4o-mini price trajectory ($30→$0.15/Mtok input in ~18 months) is documented in OpenAI's public pricing history.
iKangai, "The LLM Cost Paradox: How 'Cheaper' AI Models Are Breaking Budgets." Link.
CodeAnt, "Cheap LLM Models — The Costs Vendors Don't Show You." Link.
Garry Kasparov, "The Chess Master and the Computer," New York Review of Books, February 2010. Link.
10Clouds, "Mastering AI Token Cost Optimization: Proven Strategies." Link. SparkCo: Link.
METR, "Early 2025 AI Experienced Open-Source Developer Study," July 2025. 16 developers, 246 tasks. Link. Note: small sample size.
Anthropic, "Effective Context Engineering for AI Agents." Link.
IDC, "Balancing AI Innovation and Cost: The New FinOps Mandate." Link.