aimargin.devBook a call
First 100 audits at $2,500 — founding pricing

Cut your AI bill 40–60%in two weeks.

A fixed-fee audit of your LLM provider routing. Per-call-type cost breakdown, fallback-chain code, and a 30/60/90-day rollout plan — built by an operator who runs the same stack across 15+ providers in production.

$5,000 fixed-fee1 week turnaround100% refund within 7 daysMutual NDANo vendor kickbacks

Provider coverage

We route across 15+ AI providers in production.

The right one for each task type — text, vision, grounded search, brand voice, voice, video — is what drives the savings. Here’s the working set we benchmark and ship against.

  • Anthropic
    Anthropic
  • OpenAI
    OpenAI
  • Google
    Google
  • Groq
    Groq
  • Cerebras
    Cerebras
  • DeepSeek
    DeepSeek
  • Mistral
    Mistral
  • Together AI
    Together AI
  • SambaNova
    SambaNova
  • Cohere
    Cohere
  • Fireworks
    Fireworks
  • Replicate
    Replicate
  • ElevenLabs
    ElevenLabs
  • fal.ai
    fal.ai
  • Perplexity
    Perplexity

Plain text: Groq → Cerebras → DeepSeek → Anthropic. Vision: Vertex Gemini Flash. Grounded search: Vertex with Google Search citations. Brand-visible writing: Anthropic Claude Sonnet. Voice: ElevenLabs. That’s the kind of routing your audit ships with.

15+
AI providers in rotation
1.4M
LLM calls / month routed
60%
AI bill reduction (own stack)
<1%
Quality delta vs OpenAI baseline

These are my own production numbers from Bailar, not benchmark theatre. The audit ports the same engineering to your stack.

Where the money leaks

Nine ways your AI bill is bigger than it has to be.

If three or more of these sound familiar, your stack has 30–55% of waste sitting on the table. The audit isolates which ones are yours and what swap fixes each.

  • 01

    Single-provider lock-in

    You're paying premium for a model that's overkill for 80% of your traffic. There's a Groq, Cerebras, or DeepSeek call that does the same job for 1/10 the cost.

  • 02

    Cost opacity

    Your invoice shows one number. You can't tell which feature, which prompt, or which user cohort is the budget hog — so you can't fix it.

  • 03

    Quality fear

    You'd switch providers if you weren't terrified of waking up to a quality regression. Without shadow-test infrastructure, every swap is a coin flip.

  • 04

    Vision tax

    Sending an image to GPT-4o costs 5–10× what the same prompt costs on Gemini Flash. Most teams don't realize vision pricing is a separate world.

  • 05

    Spec stuffing

    You're running Claude Sonnet on a binary classification call when Cerebras Llama-3.3-70B does it for $0.0001 per call. The model isn't the bottleneck — the routing is.

  • 06

    Caching & batch lying dormant

    Anthropic prompt caching cuts cost 90% on repeated context. OpenAI Batch API cuts cost 50% on async work. Both unused on most stacks I audit.

  • 07

    Retry storms

    A bad rate-limit handler re-issues paid calls after the cheap fallback already returned. Two providers, both billed, no quality gain.

  • 08

    Vendor risk

    One provider price-hike, suspension, or outage takes you offline. Multi-source routing is durability insurance, not just a savings play.

  • 09

    No observability

    If you can't diff cost-per-task and quality-per-task between provider chains, you can't pick the right one. Most teams ship blind.

The flow

Same workload. Two routing chains. Sixty percent off.

Most stacks send every call to one provider. The audit returns a routing chain that pairs each task type with the cheapest model that holds quality. Here's the before-and-after I draw on the whiteboard.

Today — single-provider stack

One bill. One vendor. One unpriced model on every call.

Your stack
40K events · 10K images / mo
One destination
OpenAI GPT-4o
Text · Vision · Grounded · Brand voice — everything
Monthly bill$42,000
  • 80% of traffic is binary classification — massively over-spec’d model.
  • Vision tax: every image at 5–10× cheaper alternatives.
  • No fallback — one outage and the product goes down.
After audit — cost-aware routing

Same workload. Right model per task. Bill collapses.

Your stack
Unchanged
routeByTask(call)
TypeScript router — we ship this in your repo
  • Groq
    Plain text classify
    $0.0001 / call
  • Cerebras
    Bulk text retry
    $0.0001 / call
  • Vertex Gemini
    Vision · grounded
    $0.0008 / img
  • Anthropic Sonnet
    Brand-visible writing
    $0.015 / call
Monthly bill$17,000-60%
  • Quality vetted via shadow tests; deltas inside 1% before any swap goes live.
  • Automatic failover when a provider degrades or rate-limits.
  • Per-task cost telemetry baked in — no more invoice guessing.

These are illustrative numbers. Your actual chain depends on your traffic mix, compliance constraints, and quality bar — that’s what week one of the audit produces.

Quality vs cost

Pareto frontier — score per dollar, by call type

Live snapshot of arena.ai's blind side-by-side rankings, plotted against each model's blended $/M-token cost (70% input / 30% output). Models on the green frontier are not dominated by anything cheaper — spend above the frontier is leaving money on the table. Bailar's in-production picks per call type are pinned in red.

$0.05$0.5$5$50blended cost — $/M tokens (log scale)arena ELO score97212381503claude-opus-4-7-thinkinggemini-3.1-pro-previewgrok-4.20-beta-0309-reasoninggemini-3-flashdeepseek-v4-progemma-4-31bdeepseek-v4-flash-thinkingmimo-v2-flash (non-thinking)gemma-3-27b-itgpt-oss-120bgemma-3-12b-itgpt-oss-20bgemma-3-4b-itgemma-2-9b-it-simpollama-3-8b-instructllama-3.1-8b-instructinternlm2_5-20b-chat

Cheapest model at every quality tier (Text)

ModelProviderScore$/M in$/M outBlendedContext
claude-opus-4-7-thinkingAnthropic1503$5.00$25.00$11.001M
gemini-3.1-pro-previewGoogle1493$2.00$12.00$5.001M
grok-4.20-beta-0309-reasoningxAI1477$2.00$6.00$3.202M
gemini-3-flashGoogle1474$0.50$3.00$1.251M
deepseek-v4-proDeepSeek1463$0.43$0.87$0.561M
gemma-4-31bGoogle1451$0.14$0.40$0.22262.1K
deepseek-v4-flash-thinkingDeepSeek1439$0.14$0.28$0.181M
mimo-v2-flash (non-thinking)Xiaomi1392$0.09$0.29$0.15262.1K
gemma-3-27b-itGoogle1366$0.08$0.16$0.10131.1K
gpt-oss-120bOpenAI1353$0.04$0.18$0.08131.1K
gemma-3-12b-itGoogle1342$0.04$0.13$0.07131.1K
gpt-oss-20bOpenAI1317$0.03$0.14$0.06131.1K

Source: arena.ai/leaderboard/text · scraped 5/6/2026 · models, with public pricing. Refresh: npx tsx tools/aimargin/scrape-arena.ts

What you get

Four artifacts that pay for the engagement on day one.

No deck-only deliverables. No 'recommendations' that take a quarter to ship. Code, plan, and walkthrough — all in seven days.

  • Deliverable 01

    12-section audit report

    Per-call-type cost breakdown (text, vision, grounded search, brand voice, embeddings) with specific routing recommendations and dollar-savings projections per task type. Delivered as PDF + Markdown; yours forever.

  • Deliverable 02

    Production-ready TypeScript

    A drop-in routeByTask() entry point + 5 typed provider clients (OpenAI, Anthropic, Vertex, Groq, Cerebras) with retry, jittered backoff, timeout, graceful fallback, and observability hooks. Pasted into your repo.

  • Deliverable 03

    30/60/90-day rollout plan

    Shadow-mode plan, 10/50/100 traffic ramp, telemetry hooks for cost + latency + fallback-rate, and the daily-cost Slack digest you can wire up day one.

  • Deliverable 04

    Delivery walkthrough + week of Q&A

    30-min call where I walk your team through the report, the code, and the open questions. After delivery, a full week of email Q&A is included at no extra charge.

How it works

Four steps. One week. Zero surprises.

The audit is structured because the analysis has to be — every routing decision needs its own quality check before flipping. Here's exactly how that breaks down.

  1. Step 01 · Day 0 · 15 min

    Discovery call

    We confirm fit on a short call. I share my audit checklist, you share where the money is going. If it doesn't make sense, you walk away — no commitment, no fee.

  2. Step 02 · Days 1–4 · async

    Access + analysis

    Read-only API access (or CSVs). I map your call patterns, group by task type, and benchmark each against the routing alternatives. You get a daily Slack note on what I'm finding.

  3. Step 03 · Day 5–6 · delivery

    Recommendations + code

    Audit report + drop-in TypeScript routing layer + 30/60/90 plan land in your repo. We do a 30-min walkthrough; I stay on email Q&A for the following week.

  4. Step 04 · Optional · 2 weeks

    Implementation (if you want it)

    If you'd rather I ship it: 2-week implementation engagement. Shadow tests, traffic ramp, observability counters wired up, 30 days of post-launch tuning, final eng-team walkthrough.

Why aimargin

Three ways to fix this. Only one is fixed-fee, code-shipped, and refundable.

There's nothing magic here. The 1-week timeline + production knowledge is the only thing that justifies the price — and the refund is the only thing that justifies the engagement.

DIY
Generalist consultant
aimargin
Provider-specific knowledge
Self-taught
Broad, shallow
Production operator
Time investment
80+ hours of yours
4-week engagement
1 week
Code shipped
TS routing layer + 5 clients
Quality safeguards
Manual / hopeful
Manual
Shadow tests + diff harness
Refund
100% within 7 days
Vendor independence
Often biased
No kickbacks, ever
Price
(your time)
$20–50K
$5K · or $2.5K (founding 100)

Pricing

Two tiers. Both fixed-fee. Both refundable.

No retainer. No success fees. No 'we'll talk about price after the conversation.' If the audit doesn't make the case for the implementation, the implementation doesn't happen.

Most popular

Audit

$5,000

fixed · 1 week · 100% refund within 7 days

The full report, the routing layer, and the rollout plan — built for one stack.

  • 12-section audit report (PDF + Markdown)
  • Drop-in routeByTask() implementation
  • 5 typed provider clients (OpenAI, Anthropic, Vertex, Groq, Cerebras)
  • 30/60/90-day rollout plan with shadow-mode + traffic ramp
  • 30-min delivery walkthrough
  • 1 week of email Q&A (no extra charge)

First 100 customers: $2,500 (50% off) with code FOUNDING50.

Optional add-on

Implementation

$15,000

fixed · 2 weeks · post-audit only

If you'd rather I ship the rollout for you. Discussed at the end of the audit, never before.

  • I land the routing in your repo
  • Shadow mode + 10/50/100 traffic ramp
  • Observability counters wired up (cost, latency, fallback rate)
  • Daily-cost Slack digest configured
  • 30 days of post-launch tuning
  • Final eng-team walkthrough

Pay only if the audit makes the case.

P

A note from the founder

I learned this the hard way. You don’t have to.

I run a consumer app called Bailar that classifies ~40,000 events a month with text LLMs and runs ~10,000 images a month through vision models. I started on OpenAI-only. The bill was wild.

Routing the right call to the right provider — Groq for plain-text classify, Vertex Gemini Flash for vision, Anthropic Sonnet for brand-visible writing, Cerebras as failover — cut the LLM bill ~60% versus the OpenAI-only baseline. Quality stayed inside a 1% blind-eval delta. It wasn’t magic. It was discipline + a benchmark harness.

The same pattern reclaims 40–60% on most AI startups in the seed-to-Series-A range. The audit takes a week because the analysis is surgical — not because it’s hard, but because each call type needs its own quality vetting before the routing flips on.

I burned a weekend to a GCP free-tier suspension last year. There’s a right way to multi-source — including the boring “don’t do this” rules nobody writes down. I’ll show you both halves.

— Paul Plawin · founder · paul@aimargin.dev · LinkedIn

FAQ

Twelve answers to the questions every CTO asks me first.

If your question isn't here, drop a note at the bottom of the page — I read every one within 24 hours.

Get your AI bill audited

Two ways in. Same person on the other end.

Book a 15-minute discovery call directly, or drop a note and I'll pick up the thread within 24 hours. Either way, you're talking to the operator who built it.

Book the discovery call

15 min, free, no commitment. We’ll talk about your stack, where the cost is, and whether the audit is the right fit.

Loading scheduler…

Not ready to book? Send a quick note.

I’ll reply within 24 hours.

No drip sequences. No newsletter. I read it, I reply, that’s it.