Cut your AI bill 40–60%
in two weeks.
A fixed-fee audit of your LLM provider routing. Per-call-type cost breakdown, fallback-chain code, and a 30/60/90-day rollout plan — built by an operator who runs the same stack across 15+ providers in production.
Provider coverage
We route across 15+ AI providers in production.
The right one for each task type — text, vision, grounded search, brand voice, voice, video — is what drives the savings. Here’s the working set we benchmark and ship against.
- AnthropicBrand-visible writing
- OpenAIVision · image gen
- GoogleVertex Gemini · grounded search
- GroqFast text classify
- CerebrasBulk text inference
- DeepSeekCheap text escalation
- MistralFree-tier fallback
- Together AILlama hosting
- SambaNovaFree Llama fallback
- CohereReranking · embeddings
- FireworksOpen-model hosting
- ReplicateImage · video models
- ElevenLabsVoice TTS
- fal.aiVeo · Hailuo · Runway
- PerplexitySearch-grounded answers
Plain text: Groq → Cerebras → DeepSeek → Anthropic. Vision: Vertex Gemini Flash. Grounded search: Vertex with Google Search citations. Brand-visible writing: Anthropic Claude Sonnet. Voice: ElevenLabs. That’s the kind of routing your audit ships with.
These are my own production numbers from Bailar, not benchmark theatre. The audit ports the same engineering to your stack.
Where the money leaks
Nine ways your AI bill is bigger than it has to be.
If three or more of these sound familiar, your stack has 30–55% of waste sitting on the table. The audit isolates which ones are yours and what swap fixes each.
- 01
Single-provider lock-in
You're paying premium for a model that's overkill for 80% of your traffic. There's a Groq, Cerebras, or DeepSeek call that does the same job for 1/10 the cost.
- 02
Cost opacity
Your invoice shows one number. You can't tell which feature, which prompt, or which user cohort is the budget hog — so you can't fix it.
- 03
Quality fear
You'd switch providers if you weren't terrified of waking up to a quality regression. Without shadow-test infrastructure, every swap is a coin flip.
- 04
Vision tax
Sending an image to GPT-4o costs 5–10× what the same prompt costs on Gemini Flash. Most teams don't realize vision pricing is a separate world.
- 05
Spec stuffing
You're running Claude Sonnet on a binary classification call when Cerebras Llama-3.3-70B does it for $0.0001 per call. The model isn't the bottleneck — the routing is.
- 06
Caching & batch lying dormant
Anthropic prompt caching cuts cost 90% on repeated context. OpenAI Batch API cuts cost 50% on async work. Both unused on most stacks I audit.
- 07
Retry storms
A bad rate-limit handler re-issues paid calls after the cheap fallback already returned. Two providers, both billed, no quality gain.
- 08
Vendor risk
One provider price-hike, suspension, or outage takes you offline. Multi-source routing is durability insurance, not just a savings play.
- 09
No observability
If you can't diff cost-per-task and quality-per-task between provider chains, you can't pick the right one. Most teams ship blind.
The flow
Same workload. Two routing chains. Sixty percent off.
Most stacks send every call to one provider. The audit returns a routing chain that pairs each task type with the cheapest model that holds quality. Here's the before-and-after I draw on the whiteboard.
One bill. One vendor. One unpriced model on every call.
- 80% of traffic is binary classification — massively over-spec’d model.
- Vision tax: every image at 5–10× cheaper alternatives.
- No fallback — one outage and the product goes down.
Same workload. Right model per task. Bill collapses.
- GroqPlain text classify$0.0001 / call
- CerebrasBulk text retry$0.0001 / call
- Vertex GeminiVision · grounded$0.0008 / img
- Anthropic SonnetBrand-visible writing$0.015 / call
- Quality vetted via shadow tests; deltas inside 1% before any swap goes live.
- Automatic failover when a provider degrades or rate-limits.
- Per-task cost telemetry baked in — no more invoice guessing.
These are illustrative numbers. Your actual chain depends on your traffic mix, compliance constraints, and quality bar — that’s what week one of the audit produces.
Quality vs cost
Pareto frontier — score per dollar, by call type
Live snapshot of arena.ai's blind side-by-side rankings, plotted against each model's blended $ /M-token cost (70% input / 30% output). Models on the green frontier are not dominated by anything cheaper — spend above the frontier is leaving money on the table. Bailar's in-production picks per call type are pinned in red.
Cheapest model at every quality tier (Text)
| Model | Provider | Score | $/M in | $/M out | Blended | Context |
|---|---|---|---|---|---|---|
| claude-opus-4-7-thinking | Anthropic | 1503 | $5.00 | $25.00 | $11.00 | 1M |
| gemini-3.1-pro-preview | 1493 | $2.00 | $12.00 | $5.00 | 1M | |
| grok-4.20-beta-0309-reasoning | xAI | 1477 | $2.00 | $6.00 | $3.20 | 2M |
| gemini-3-flash | 1474 | $0.50 | $3.00 | $1.25 | 1M | |
| deepseek-v4-pro | DeepSeek | 1463 | $0.43 | $0.87 | $0.56 | 1M |
| gemma-4-31b | 1451 | $0.14 | $0.40 | $0.22 | 262.1K | |
| deepseek-v4-flash-thinking | DeepSeek | 1439 | $0.14 | $0.28 | $0.18 | 1M |
| mimo-v2-flash (non-thinking) | Xiaomi | 1392 | $0.09 | $0.29 | $0.15 | 262.1K |
| gemma-3-27b-it | 1366 | $0.08 | $0.16 | $0.10 | 131.1K | |
| gpt-oss-120b | OpenAI | 1353 | $0.04 | $0.18 | $0.08 | 131.1K |
| gemma-3-12b-it | 1342 | $0.04 | $0.13 | $0.07 | 131.1K | |
| gpt-oss-20b | OpenAI | 1317 | $0.03 | $0.14 | $0.06 | 131.1K |
Source: arena.ai/leaderboard/text · scraped 5/6/2026 · models, with public pricing. Refresh: npx tsx tools/aimargin/scrape-arena.ts
What you get
Four artifacts that pay for the engagement on day one.
No deck-only deliverables. No 'recommendations' that take a quarter to ship. Code, plan, and walkthrough — all in seven days.
- Deliverable 01
12-section audit report
Per-call-type cost breakdown (text, vision, grounded search, brand voice, embeddings) with specific routing recommendations and dollar-savings projections per task type. Delivered as PDF + Markdown; yours forever.
- Deliverable 02
Production-ready TypeScript
A drop-in routeByTask() entry point + 5 typed provider clients (OpenAI, Anthropic, Vertex, Groq, Cerebras) with retry, jittered backoff, timeout, graceful fallback, and observability hooks. Pasted into your repo.
- Deliverable 03
30/60/90-day rollout plan
Shadow-mode plan, 10/50/100 traffic ramp, telemetry hooks for cost + latency + fallback-rate, and the daily-cost Slack digest you can wire up day one.
- Deliverable 04
Delivery walkthrough + week of Q&A
30-min call where I walk your team through the report, the code, and the open questions. After delivery, a full week of email Q&A is included at no extra charge.
How it works
Four steps. One week. Zero surprises.
The audit is structured because the analysis has to be — every routing decision needs its own quality check before flipping. Here's exactly how that breaks down.
- Step 01 · Day 0 · 15 min
Discovery call
We confirm fit on a short call. I share my audit checklist, you share where the money is going. If it doesn't make sense, you walk away — no commitment, no fee.
- Step 02 · Days 1–4 · async
Access + analysis
Read-only API access (or CSVs). I map your call patterns, group by task type, and benchmark each against the routing alternatives. You get a daily Slack note on what I'm finding.
- Step 03 · Day 5–6 · delivery
Recommendations + code
Audit report + drop-in TypeScript routing layer + 30/60/90 plan land in your repo. We do a 30-min walkthrough; I stay on email Q&A for the following week.
- Step 04 · Optional · 2 weeks
Implementation (if you want it)
If you'd rather I ship it: 2-week implementation engagement. Shadow tests, traffic ramp, observability counters wired up, 30 days of post-launch tuning, final eng-team walkthrough.
Why aimargin
Three ways to fix this. Only one is fixed-fee, code-shipped, and refundable.
There's nothing magic here. The 1-week timeline + production knowledge is the only thing that justifies the price — and the refund is the only thing that justifies the engagement.
Pricing
Two tiers. Both fixed-fee. Both refundable.
No retainer. No success fees. No 'we'll talk about price after the conversation.' If the audit doesn't make the case for the implementation, the implementation doesn't happen.
Most popular
Audit
fixed · 1 week · 100% refund within 7 days
The full report, the routing layer, and the rollout plan — built for one stack.
- 12-section audit report (PDF + Markdown)
- Drop-in routeByTask() implementation
- 5 typed provider clients (OpenAI, Anthropic, Vertex, Groq, Cerebras)
- 30/60/90-day rollout plan with shadow-mode + traffic ramp
- 30-min delivery walkthrough
- 1 week of email Q&A (no extra charge)
First 100 customers: $2,500 (50% off) with code FOUNDING50.
Optional add-on
Implementation
fixed · 2 weeks · post-audit only
If you'd rather I ship the rollout for you. Discussed at the end of the audit, never before.
- I land the routing in your repo
- Shadow mode + 10/50/100 traffic ramp
- Observability counters wired up (cost, latency, fallback rate)
- Daily-cost Slack digest configured
- 30 days of post-launch tuning
- Final eng-team walkthrough
Pay only if the audit makes the case.
A note from the founder
I learned this the hard way. You don’t have to.
I run a consumer app called Bailar that classifies ~40,000 events a month with text LLMs and runs ~10,000 images a month through vision models. I started on OpenAI-only. The bill was wild.
Routing the right call to the right provider — Groq for plain-text classify, Vertex Gemini Flash for vision, Anthropic Sonnet for brand-visible writing, Cerebras as failover — cut the LLM bill ~60% versus the OpenAI-only baseline. Quality stayed inside a 1% blind-eval delta. It wasn’t magic. It was discipline + a benchmark harness.
The same pattern reclaims 40–60% on most AI startups in the seed-to-Series-A range. The audit takes a week because the analysis is surgical — not because it’s hard, but because each call type needs its own quality vetting before the routing flips on.
I burned a weekend to a GCP free-tier suspension last year. There’s a right way to multi-source — including the boring “don’t do this” rules nobody writes down. I’ll show you both halves.
— Paul Plawin · founder · paul@aimargin.dev · LinkedIn
Read more
Want to dig deeper before you book?
Four pages of detail on the process, the pricing, the case study, and me. No gating, no email-walls.
The process, in detail
Step-by-step view of every audit deliverable, what we ship in week one, and how the implementation engagement is structured.
Full pricing breakdown
Every line item in the audit and implementation engagements, when add-ons make sense, and the founding-100 discount mechanics.
Bailar case study
How the same routing engineering cut my own production AI bill ~60% across 15+ providers without quality regression.
About Paul
Who I am, why I'm doing this, and what kinds of stacks I'm best (and worst) at auditing.
FAQ
Twelve answers to the questions every CTO asks me first.
If your question isn't here, drop a note at the bottom of the page — I read every one within 24 hours.
Get your AI bill audited
Two ways in. Same person on the other end.
Book a 15-minute discovery call directly, or drop a note and I'll pick up the thread within 24 hours. Either way, you're talking to the operator who built it.
Book the discovery call
15 min, free, no commitment. We’ll talk about your stack, where the cost is, and whether the audit is the right fit.