Part IV — Production Engineering/Chapter 26: Token Economics
Chapter 26

Token Economics & AI Business Models — The Math That Makes or Breaks AI Companies

Every software business in the last twenty years was built on the same economic miracle: near-zero marginal cost. Build the product once, then serve each additional user for fractions of a penny. A SaaS company spending $50,000 per month on infrastructure could serve 10,000 users or 100,000 users with barely any difference in cost. This is why SaaS gross margins hit 77-88% — the gap between what you charge and what it costs to serve each user is enormous.

AI changes this equation in a way that most builders do not fully appreciate until their first cloud bill arrives.

Every time a user interacts with your AI feature, you pay for computation. Not a fraction of a penny — real money. A single GPT-4 conversation can cost $0.05-0.50. A Claude Opus call with a long prompt can cost $0.10-1.00. Multiply that by thousands of daily users, and you are looking at infrastructure costs that would make a traditional SaaS founder faint.

Analogy

Traditional SaaS is like a bus pass. You sell a monthly pass for $50. Whether the rider takes 2 rides or 200, your cost stays roughly the same — the bus runs the same route regardless. AI products are like a taxi service. Every ride has a real, measurable cost — fuel, driver time, vehicle wear. If you sell an "unlimited taxi rides" pass for $50/month and a customer takes 200 rides, you are bankrupt. The pricing model that works for the bus does not work for the taxi. And AI is a taxi.

Part 1: The Cost Structure of AI Products

Where the Money Goes

When a user sends a message to your AI chatbot, here is what happens financially:

USER SENDS MESSAGE: "Summarize this 10-page document"

COST BREAKDOWN:
┌─────────────────────────────────────────────────┐
│  INPUT TOKENS (the document + system prompt)    │
│  ~4,000 tokens × $15/M tokens (Opus)            │
│  = $0.060                                       │
├─────────────────────────────────────────────────┤
│  OUTPUT TOKENS (the summary)                    │
│  ~800 tokens × $75/M tokens (Opus)              │
│  = $0.060                                       │
├─────────────────────────────────────────────────┤
│  EMBEDDING (for RAG retrieval)                  │
│  ~4,000 tokens × $0.10/M tokens                │
│  = $0.0004                                      │
├─────────────────────────────────────────────────┤
│  VECTOR DB QUERY                                │
│  ~$0.0001                                       │
├─────────────────────────────────────────────────┤
│  TOTAL COST FOR ONE INTERACTION: ~$0.12         │
└─────────────────────────────────────────────────┘

Now scale this:

Daily InteractionsCost per InteractionDaily CostMonthly Cost
1,000$0.12$120$3,600
10,000$0.12$1,200$36,000
100,000$0.12$12,000$360,000
1,000,000$0.12$120,000$3,600,000

At 10,000 daily interactions — a modest number for a product with a few thousand active users — you are spending $36,000 per month on AI inference alone. Before servers, before databases, before engineering salaries.

Real-Life Example

In 2024, a16z (Andreessen Horowitz) published a widely-cited analysis of AI company economics titled "The Cost of AI Inference." They found that AI-native companies were spending 20-40% of their revenue on model inference — compared to 5-10% of revenue on infrastructure for traditional SaaS companies. For some AI startups, inference costs exceeded revenue entirely, sustained only by venture capital funding. The report concluded that "managing inference cost is the defining operational challenge of AI companies."

SaaS Margins vs. AI Margins

The margin difference between traditional SaaS and AI products is stark:

MARGIN COMPARISON

Traditional SaaS:
  Revenue per user:     $50/month
  Infrastructure cost:  $2-5/month    (servers, DB, CDN)
  Gross margin:         90-96%

AI-Native Product:
  Revenue per user:     $50/month
  AI inference cost:    $15-25/month  (model API calls)
  Other infrastructure: $2-5/month
  Gross margin:         40-66%

┌──────────────────────────────────────────────┐
│            GROSS MARGIN SPECTRUM              │
│                                              │
│  SaaS        ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  77-88% │
│  AI (good)   ▓▓▓▓▓▓▓▓▓▓▓▓          55-65% │
│  AI (bad)    ▓▓▓▓▓▓▓               30-45% │
│  AI (death)  ▓▓▓                    10-20% │
│                                              │
│  Venture-fundable minimum: ~50%              │
└──────────────────────────────────────────────┘
Why This Exists

The difference between a 80% margin business and a 55% margin business is not 25 percentage points. It is the difference between a company that can afford aggressive marketing, big engineering teams, and mistakes — and a company where one pricing miscalculation, one unexpectedly chatty user cohort, or one model price increase can destroy profitability overnight. AI margins are thinner and more volatile. This is the fundamental strategic challenge of building AI products.

Part 2: The Math, Worked Out

Let us build a detailed financial model for a realistic AI product: a customer support chatbot serving a SaaS company with 5,000 paying customers.

Assumptions:

  • Average customer sends 15 support messages per month
  • Each message requires: system prompt (500 tokens) + conversation history (1,500 tokens avg) + user message (100 tokens) + response (400 tokens)
  • Using Claude Sonnet ($3/M input tokens, $15/M output tokens as of March 2026)
PER-INTERACTION COST:
  Input:  (500 + 1500 + 100) = 2,100 tokens × $3/M   = $0.0063
  Output: 400 tokens × $15/M                          = $0.0060
  Total per interaction:                               = $0.0123

MONTHLY SCALE:
  5,000 customers × 15 messages = 75,000 interactions
  75,000 × $0.0123 = $922.50/month

  Not bad! Under $1,000/month for AI serving 5,000 customers.

But here is where the math gets dangerous. Let us say the product grows, and you add features:

FEATURE CREEP COST EXPLOSION:

Original chatbot:                    $922/month

+ Add RAG (retrieve 5 docs per query):
  Extra 3,000 input tokens/query
  75,000 × 3,000 × $3/M = $675
  New total:                         $1,597/month

+ Add conversation memory (last 10 msgs):
  Extra 4,000 input tokens/query
  75,000 × 4,000 × $3/M = $900
  New total:                         $2,497/month

+ Upgrade to Opus for complex queries (20%):
  15,000 queries × $0.12 avg = $1,800
  New total:                         $4,297/month

+ Add follow-up suggestions (extra output):
  Extra 200 output tokens/query
  75,000 × 200 × $15/M = $225
  New total:                         $4,522/month

COST GREW 4.9x WITHOUT ADDING A SINGLE NEW CUSTOMER.
Why This Exists

In traditional SaaS, adding features to your product is nearly free at the infrastructure level. Adding a new page, a new filter, a new export button — the server cost is negligible. In AI products, every feature that touches the model has a direct, measurable cost. A "summarize this conversation" button is not a free feature — it is a financial commitment that scales with usage. Product decisions are cost decisions.

Part 3: The Margin Trap

The margin trap is when you charge what feels like a reasonable price, your users love the product, your revenue is growing — and you are losing money on every interaction.

Here is how it happens:

THE MARGIN TRAP

  You charge: $20/month per user
  Average usage: 200 AI interactions/month

  Your cost per interaction: $0.09
  Your cost per user/month: $0.09 × 200 = $18.00

  Revenue:    $20.00
  AI cost:   -$18.00
  Other cost: -$3.00  (servers, support, payment processing)
  ──────────────────
  Profit:     -$1.00 per user per month

  YOU ARE PAYING USERS $1/MONTH TO USE YOUR PRODUCT.

The trap is insidious because it does not look like a problem from the outside. Revenue is growing. Users are happy. The product is working. But every new customer makes you poorer.

The most dangerous variant of the margin trap is power user economics. In traditional SaaS, power users are your best customers — they use more features, they are more likely to upgrade, and they churn less. In AI products, power users can be your most expensive customers:

User SegmentMonthly InteractionsCostRevenueMargin
Light user30$2.70$20+$17.30 (87%)
Medium user150$13.50$20+$6.50 (33%)
Heavy user500$45.00$20-$25.00 (-125%)
Power user2,000$180.00$20-$160.00 (-800%)

Your light users subsidize your power users. If you attract more power users (which any good product does), your margins collapse.

Real-Life Example

This is the economics behind Perplexity's pricing. Perplexity Pro costs $20/month and offers "unlimited" AI-powered searches — but each Pro search costs Perplexity approximately $0.05-0.15 in model inference, web crawling, and processing. A heavy user running 100 searches per day costs Perplexity $150-450 per month in inference alone — far more than the $20 subscription. Perplexity's business model depends on the distribution of usage: enough light users (5-10 searches/day) must subsidize the heavy users. If the usage distribution shifts toward power users, the economics break.

Part 4: Pricing Models for AI Products

The pricing challenge for AI products: how do you charge enough to cover variable costs without scaring away users who expect SaaS-like pricing?

Model 1: Usage-Based Pricing

Charge per unit of consumption — per query, per document processed, per minute of audio transcribed.

Pros: Costs and revenue scale together. Heavy users pay more. No margin trap. Cons: Unpredictable bills scare customers. Friction on every interaction ("should I ask this question, or is it too expensive?").

Who uses it: OpenAI API, Anthropic API, AWS Bedrock, Twilio. Enterprise products where the buyer is an engineering team that understands variable costs.

Model 2: Tiered Subscription

Fixed monthly fee with usage tiers. Free tier: 50 queries/month. Pro: 500 queries/month. Enterprise: 5,000 queries/month.

Pros: Predictable for customers. Marketing-friendly ("starts at $20/month"). Upsell path is clear. Cons: You must set tier boundaries correctly. Too generous and you lose money. Too restrictive and users churn.

Who uses it: ChatGPT ($20/month Plus, $200/month Pro), Jasper, Copy.ai. Consumer and prosumer products.

Model 3: Outcome-Based Pricing

Charge for results, not consumption. Per ticket resolved (customer support AI), per lead generated (sales AI), per document processed (legal AI).

Pros: Aligns your price with the value you deliver. Customers pay for outcomes they care about, not tokens they do not understand. Cons: You absorb the cost variance. Some tickets take 2 AI calls to resolve; some take 20. Your margin depends on your AI's efficiency.

Who uses it: Intercom's AI agent (charges per resolution), some legal AI tools (per contract reviewed). Emerging and growing.

Model 4: Hybrid (Base + Usage)

A base subscription fee plus a per-unit charge above a threshold. "$20/month includes 200 queries. $0.05 per query after that."

Pros: Revenue floor from subscriptions. Cost alignment from usage charges. Best of both worlds. Cons: More complex to communicate. Customers need to understand two pricing dimensions.

Who uses it: Cursor ($20/month + metered premium completions), Notion AI ($10/member/month with fair-use limits).

PRICING MODEL COMPARISON

                    Revenue           Cost
                    Predictability    Alignment    Simplicity
────────────────────────────────────────────────────────────
Usage-based         Low               High         Medium
Tiered subscription High              Low          High
Outcome-based       Medium            Medium       High
Hybrid              Medium            High         Low
────────────────────────────────────────────────────────────
Real-Life Example

Cursor (the AI-powered code editor) uses a hybrid model as of March 2026. The Pro plan costs $20/month and includes 500 "fast" completions using their most capable model. Beyond 500, users are either throttled to a slower model or can pay per completion. This creates a natural usage cap that prevents the margin trap while giving power users an upgrade path. Cursor's pricing acknowledges the core tension: developers want unlimited AI assistance, but unlimited AI assistance at the quality they want costs more than $20/month.

Part 5: Cost Optimization — The 10x Playbook

The good news: there are proven techniques to reduce AI inference costs by 3-10x without reducing quality. The discipline of cost optimization is what separates AI companies that achieve sustainable margins from those that burn through venture capital.

Technique 1: Model Routing (5-10x savings)

Not every query needs your most expensive model. A user asking "what is my account balance?" does not need Opus. A user asking "analyze this contract and identify non-standard clauses" does.

Model routing (using a classifier to direct each query to the cheapest model that can handle it adequately) is the single most effective cost optimization.

MODEL ROUTING ARCHITECTURE

  User Query
      │
      ▼
┌──────────────┐
│  CLASSIFIER  │  (small, fast, cheap model)
│  Complexity:  │
│  Low/Med/High│
└──┬────┬────┬─┘
   │    │    │
   ▼    ▼    ▼
┌─────┐┌──────┐┌──────┐
│Haiku││Sonnet││ Opus │
│$0.25││ $3   ││ $15  │  per million input tokens
│/M   ││ /M   ││ /M   │
└─────┘└──────┘└──────┘

COST IMPACT (10,000 queries/day):
  Without routing (all Opus):    $150/day
  With routing (70/20/10 split): $25/day
  SAVINGS: 83%

The classifier itself can be a small model, a keyword-based rule system, or a fine-tuned classifier. The classifier costs are negligible (fractions of a cent per query) compared to the savings from routing.

Technique 2: Prompt Caching (up to 90% savings on repeated context)

If your system prompt, few-shot examples, or retrieved documents are the same across many queries, you are paying for the same input tokens repeatedly. Prompt caching (a feature offered by Anthropic, OpenAI, and others) stores these repeated prompt prefixes and charges reduced rates for cached tokens.

Anthropic's prompt caching: cached input tokens cost 10% of the normal input price. If your system prompt is 2,000 tokens and you make 1,000 calls, you pay full price for the first call and 90% less for the next 999.

WITHOUT CACHING:
  System prompt: 2,000 tokens × $15/M × 1,000 calls = $30.00

WITH CACHING:
  First call:     2,000 tokens × $15/M × 1 call  = $0.03
  Cached calls:   2,000 tokens × $1.50/M × 999    = $3.00
  Total:                                           = $3.03

  SAVINGS: 90%

Technique 3: Semantic Caching (30-70% savings for repeated questions)

Different from prompt caching. Semantic caching stores complete query-response pairs. When a new query comes in, check if a semantically similar query has been asked before. If so, return the cached response without calling the model at all.

Implementation: embed the incoming query, search your cache of previous query embeddings, and if similarity exceeds a threshold (e.g., cosine similarity > 0.95), return the cached response.

This works exceptionally well for products with repetitive queries — customer support (many users ask the same questions), documentation search (common questions about setup, billing, features), and educational tools.

TechniqueSavingsBest ForComplexity
Model routing5-10xAll AI productsMedium
Prompt cachingUp to 90%Products with long, stable system promptsLow
Semantic caching30-70%Products with repetitive queriesMedium
Batching50%Non-real-time processing (reports, analysis)Low
Output length limits20-40%Products where shorter responses sufficeLow

Technique 4: Batching (50% savings)

Anthropic's Batch API processes requests asynchronously at 50% of the standard price. If your use case does not require real-time responses — nightly report generation, bulk document processing, periodic content creation — batch it.

The tradeoff: responses arrive within 24 hours instead of seconds. For real-time features, this is not an option. For background processing, it is free money.

Technique 5: Output Length Management

LLM output tokens are 3-5x more expensive than input tokens (for Opus: $15/M input, $75/M output). Every unnecessary word in the model's response costs money.

Techniques:

  • Explicit length constraints in the system prompt: "Respond in 2-3 sentences maximum."
  • Use max_tokens parameter to hard-cap output length
  • Post-process: if the model generates 500 tokens but you only need the first 200, truncate (though you still pay for generated tokens up to the stop point)

Part 6: Real-World AI Pricing Case Studies

Perplexity

  • Free tier: ~5 Pro searches/day (using smaller models)
  • Pro ($20/month): 300+ Pro searches/day, access to more capable models
  • Cost structure: Each Pro search costs $0.05-0.15 (web crawling + multiple model calls + synthesis)
  • The bet: Most Pro users will use 5-10 searches/day ($0.25-1.50/day cost), covering the power users who use 50+ searches/day ($2.50-7.50/day cost)

Notion AI

  • $10/member/month added to existing Notion subscription
  • Uses model routing: simple tasks (summarize, translate) → cheaper models; complex tasks (write, analyze) → capable models
  • Fair-use policy: no hard limit, but extremely heavy users get throttled
  • Advantage: Notion already has the subscription relationship — AI is an upsell, not a standalone product

Cursor

  • Free: 2,000 code completions
  • Pro ($20/month): 500 fast premium requests + unlimited slow requests
  • Business ($40/user/month): Higher limits + admin controls
  • Model routing is core: Tab completions use a small, fast model. Chat uses Sonnet. "Thinking" mode uses Opus or equivalent.
  • The 500 fast request cap explicitly addresses the margin trap — it sets a ceiling on per-user cost

ChatGPT

  • Free: GPT-4o-mini with limits
  • Plus ($20/month): GPT-4o, some o1 access, DALL-E, browsing
  • Pro ($200/month): Unlimited access to all models including o1-pro
  • The economics: At $200/month, a Pro user doing 50 o1-pro queries/day costs OpenAI significantly more than $200. The $200 price point likely loses money on heavy users but positions OpenAI as the premium option, driving enterprise leads.
Why This Exists

Every AI company's pricing is a bet on usage distribution. If most users are light users, the math works. If your product attracts disproportionately heavy users (developers, researchers, writers who use AI all day), the math breaks. Understanding your usage distribution — and designing features and limits to shape it — is as important as the pricing itself.

Part 7: Building a Sustainable AI Business

The path from "this costs more than we charge" to "this is a sustainable business" follows a predictable sequence:

AI BUSINESS MATURITY MODEL

Stage 1: PROTOTYPE
  Cost: $$$$ (no optimization)
  Revenue: $0
  Goal: Prove the product works

Stage 2: LAUNCH
  Cost: $$$ (basic optimization)
  Revenue: $ (first customers)
  Goal: Prove users will pay
  Margin: Often negative

Stage 3: OPTIMIZE
  Cost: $$ (routing, caching, batching)
  Revenue: $$ (growing customer base)
  Goal: Reach positive unit economics
  Margin: 30-50%

Stage 4: SCALE
  Cost: $ (custom models, edge inference)
  Revenue: $$$ (pricing power from retention)
  Goal: Expand margins while growing
  Margin: 50-70%

Stage 5: MOAT
  Cost: $ (proprietary efficiency)
  Revenue: $$$$ (premium pricing justified by data moat)
  Goal: Defensibility
  Margin: 60-75%

The key insight: AI margins improve over time if you invest in optimization. Model costs drop (Anthropic and OpenAI reduce prices regularly — Claude Sonnet's price has dropped 3x since its initial release). Your optimization techniques improve (better routing, better caching, fine-tuned smaller models that match larger model quality for your specific use case). Your data moat grows (the more user interactions you have, the better your semantic cache, the more training data for fine-tuning).

The companies that fail are the ones that skip Stage 3 — they launch, acquire users, and try to scale without optimizing. They end up in the margin trap at massive scale, where the losses are too large to fix with incremental optimizations.

Real-Life Example

GitHub Copilot reportedly lost an average of $20 per user per month in its first year (2022-2023), with some power users costing GitHub over $80/month against a $10/month subscription. Microsoft absorbed these losses as a strategic investment — Copilot drove GitHub Enterprise adoption, which generates far more revenue. This strategy works if you are Microsoft. If you are a startup, you need to reach positive unit economics before the venture funding runs out.

Exercise: Build a Cost Model for Your AI Product

Take your product (or a product you are planning) and build a detailed cost model.

Step 1: Map Every AI Call

List every place in your product where an AI model is called. For each call, estimate:

  • Average input tokens (system prompt + user input + context)
  • Average output tokens
  • Model used (and price per million tokens)
  • Frequency (calls per user per day/month)

Step 2: Calculate Per-User Cost

Per-user monthly cost = Sum of (calls per month x cost per call)

Step 3: Model Three Scenarios

ScenarioLight User (P25)Medium User (P50)Heavy User (P90)
Monthly interactions???
Monthly AI cost???
Your price???
Margin???

Step 4: Apply Optimizations

For each optimization technique (model routing, prompt caching, semantic caching, batching, output limits), estimate the cost reduction. Recalculate your margins.

Step 5: Set Your Pricing

Based on your cost model, choose a pricing strategy (usage-based, tiered, hybrid, outcome-based). Justify your choice with the numbers.

Deliverable: A spreadsheet or document showing your cost per interaction, cost per user at different usage levels, break-even analysis, and pricing recommendation.

This exercise should take 90 minutes. It will be the most important financial analysis you do for your AI product.

Chapter endnotes

[1] The a16z analysis of AI company economics, "Who Owns the Generative AI Platform?" and follow-up posts on inference costs (2023-2024), documented the 20-40% revenue allocation to inference and the margin compression facing AI-native companies. These reports are available on the a16z website and have been widely cited in the AI startup ecosystem.

[2] SaaS gross margin benchmarks (77-88%) are from public company filings and industry analyses by KeyBanc Capital Markets, Bessemer Venture Partners ("Cloud Index"), and Gartner. These figures represent mature SaaS companies; early-stage SaaS companies may have lower margins due to scale inefficiencies.

[3] Model pricing data (Opus at $15/$75 per M tokens, Sonnet at $3/$15 per M tokens, Haiku pricing) is from Anthropic's published API pricing as of March 2026. OpenAI and other providers have comparable tiered pricing. Prices have declined 50-80% since initial launches and are expected to continue declining.

[4] The GitHub Copilot loss-per-user figures ($20/month average, $80/month for power users) were reported by The Wall Street Journal in October 2023, citing internal Microsoft documents. Microsoft has not publicly confirmed specific per-user economics but has acknowledged Copilot as a "strategic investment."

[5] Perplexity's pricing and cost structure analysis is based on publicly available subscription pricing, published API costs for the models Perplexity uses, and industry estimates of web crawling costs. Specific per-query cost figures are estimates based on observable model usage patterns.

[6] Prompt caching pricing (90% discount on cached tokens) is from Anthropic's API documentation. OpenAI offers a similar feature with comparable discounts. The specific savings calculations assume a stable system prompt that remains in cache across requests.

[7] Cursor's pricing evolution — from flat-rate to hybrid with usage caps — is documented in their blog posts and pricing page changes through 2024-2026. The 500 fast request cap was a deliberate response to margin pressure from heavy users.

[8] The "AI Business Maturity Model" in Part 7 is a synthesis of frameworks from a16z, Sequoia Capital, and Y Combinator partner essays on AI startup economics, combined with observable patterns from companies that have navigated the prototype-to-scale journey.