First 100 audits at $2,500 — founding pricing

Cut your AI bill 40–60%
in two weeks.

A fixed-fee audit of your LLM provider routing. Per-call-type cost breakdown, fallback-chain code, and a 30/60/90-day rollout plan — built by an operator who runs the same stack across 15+ providers in production.

Book a 15-min discovery call See what you get

$5,000 fixed-fee1 week turnaround100% refund within 7 daysMutual NDANo vendor kickbacks

Provider coverage

We route across 15+ AI providers in production.

The right one for each task type — text, vision, grounded search, brand voice, voice, video — is what drives the savings. Here’s the working set we benchmark and ship against.

AnthropicBrand-visible writing
OpenAIVision · image gen
GoogleVertex Gemini · grounded search
GroqFast text classify
CerebrasBulk text inference
DeepSeekCheap text escalation
MistralFree-tier fallback
Together AILlama hosting
SambaNovaFree Llama fallback
CohereReranking · embeddings
FireworksOpen-model hosting
ReplicateImage · video models
ElevenLabsVoice TTS
fal.aiVeo · Hailuo · Runway
PerplexitySearch-grounded answers

Plain text: Groq → Cerebras → DeepSeek → Anthropic. Vision: Vertex Gemini Flash. Grounded search: Vertex with Google Search citations. Brand-visible writing: Anthropic Claude Sonnet. Voice: ElevenLabs. That’s the kind of routing your audit ships with.

15+

AI providers in rotation

1.4M

LLM calls / month routed

60%

AI bill reduction (own stack)

<1%

Quality delta vs OpenAI baseline

These are my own production numbers from Bailar, not benchmark theatre. The audit ports the same engineering to your stack.

Where the money leaks

Nine ways your AI bill is bigger than it has to be.

If three or more of these sound familiar, your stack has 30–55% of waste sitting on the table. The audit isolates which ones are yours and what swap fixes each.

01
Single-provider lock-in
You're paying premium for a model that's overkill for 80% of your traffic. There's a Groq, Cerebras, or DeepSeek call that does the same job for 1/10 the cost.
02
Cost opacity
Your invoice shows one number. You can't tell which feature, which prompt, or which user cohort is the budget hog — so you can't fix it.
03
Quality fear
You'd switch providers if you weren't terrified of waking up to a quality regression. Without shadow-test infrastructure, every swap is a coin flip.
04
Vision tax
Sending an image to GPT-4o costs 5–10× what the same prompt costs on Gemini Flash. Most teams don't realize vision pricing is a separate world.
05
Spec stuffing
You're running Claude Sonnet on a binary classification call when Cerebras Llama-3.3-70B does it for $0.0001 per call. The model isn't the bottleneck — the routing is.
06
Caching & batch lying dormant
Anthropic prompt caching cuts cost 90% on repeated context. OpenAI Batch API cuts cost 50% on async work. Both unused on most stacks I audit.
07
Retry storms
A bad rate-limit handler re-issues paid calls after the cheap fallback already returned. Two providers, both billed, no quality gain.
08
Vendor risk
One provider price-hike, suspension, or outage takes you offline. Multi-source routing is durability insurance, not just a savings play.
09
No observability
If you can't diff cost-per-task and quality-per-task between provider chains, you can't pick the right one. Most teams ship blind.

The flow

Same workload. Two routing chains. Sixty percent off.

Most stacks send every call to one provider. The audit returns a routing chain that pairs each task type with the cheapest model that holds quality. Here's the before-and-after I draw on the whiteboard.

Today — single-provider stack

One bill. One vendor. One unpriced model on every call.

Your stack

40K events · 10K images / mo

One destination

OpenAI GPT-4o

Text · Vision · Grounded · Brand voice — everything

Monthly bill$42,000

80% of traffic is binary classification — massively over-spec’d model.
Vision tax: every image at 5–10× cheaper alternatives.
No fallback — one outage and the product goes down.

After audit — cost-aware routing

Same workload. Right model per task. Bill collapses.

Your stack

Unchanged

routeByTask(call)

TypeScript router — we ship this in your repo

Groq
Plain text classify
$0.0001 / call
Cerebras
Bulk text retry
$0.0001 / call
Vertex Gemini
Vision · grounded
$0.0008 / img
Anthropic Sonnet
Brand-visible writing
$0.015 / call

Monthly bill$17,000-60%

Quality vetted via shadow tests; deltas inside 1% before any swap goes live.
Automatic failover when a provider degrades or rate-limits.
Per-task cost telemetry baked in — no more invoice guessing.

These are illustrative numbers. Your actual chain depends on your traffic mix, compliance constraints, and quality bar — that’s what week one of the audit produces.

See how the audit produces this

Quality vs cost

Pareto frontier — score per dollar, by call type

Live snapshot of arena.ai's blind side-by-side rankings, plotted against each model's blended $ /M-token cost (70% input / 30% output). Models on the green frontier are not dominated by anything cheaper — spend above the frontier is leaving money on the table. Bailar's in-production picks per call type are pinned in red.

Cheapest model at every quality tier (Text)

Model	Provider	Score	$/M in	$/M out	Blended	Context
claude-opus-4-7-thinking	Anthropic	1503	$5.00	$25.00	$11.00	1M
gemini-3.1-pro-preview	Google	1493	$2.00	$12.00	$5.00	1M
grok-4.20-beta-0309-reasoning	xAI	1477	$2.00	$6.00	$3.20	2M
gemini-3-flash	Google	1474	$0.50	$3.00	$1.25	1M
deepseek-v4-pro	DeepSeek	1463	$0.43	$0.87	$0.56	1M
gemma-4-31b	Google	1451	$0.14	$0.40	$0.22	262.1K
deepseek-v4-flash-thinking	DeepSeek	1439	$0.14	$0.28	$0.18	1M
mimo-v2-flash (non-thinking)	Xiaomi	1392	$0.09	$0.29	$0.15	262.1K
gemma-3-27b-it	Google	1366	$0.08	$0.16	$0.10	131.1K
gpt-oss-120b	OpenAI	1353	$0.04	$0.18	$0.08	131.1K
gemma-3-12b-it	Google	1342	$0.04	$0.13	$0.07	131.1K
gpt-oss-20b	OpenAI	1317	$0.03	$0.14	$0.06	131.1K

Source: arena.ai/leaderboard/text · scraped 5/6/2026 · models, with public pricing. Refresh: npx tsx tools/aimargin/scrape-arena.ts

What you get

Four artifacts that pay for the engagement on day one.

No deck-only deliverables. No 'recommendations' that take a quarter to ship. Code, plan, and walkthrough — all in seven days.

Deliverable 01
12-section audit report
Per-call-type cost breakdown (text, vision, grounded search, brand voice, embeddings) with specific routing recommendations and dollar-savings projections per task type. Delivered as PDF + Markdown; yours forever.
Deliverable 02
Production-ready TypeScript
A drop-in routeByTask() entry point + 5 typed provider clients (OpenAI, Anthropic, Vertex, Groq, Cerebras) with retry, jittered backoff, timeout, graceful fallback, and observability hooks. Pasted into your repo.
Deliverable 03
30/60/90-day rollout plan
Shadow-mode plan, 10/50/100 traffic ramp, telemetry hooks for cost + latency + fallback-rate, and the daily-cost Slack digest you can wire up day one.
Deliverable 04
Delivery walkthrough + week of Q&A
30-min call where I walk your team through the report, the code, and the open questions. After delivery, a full week of email Q&A is included at no extra charge.

How it works

Four steps. One week. Zero surprises.

The audit is structured because the analysis has to be — every routing decision needs its own quality check before flipping. Here's exactly how that breaks down.

Step 01 · Day 0 · 15 min
Discovery call
We confirm fit on a short call. I share my audit checklist, you share where the money is going. If it doesn't make sense, you walk away — no commitment, no fee.
Step 02 · Days 1–4 · async
Access + analysis
Read-only API access (or CSVs). I map your call patterns, group by task type, and benchmark each against the routing alternatives. You get a daily Slack note on what I'm finding.
Step 03 · Day 5–6 · delivery
Recommendations + code
Audit report + drop-in TypeScript routing layer + 30/60/90 plan land in your repo. We do a 30-min walkthrough; I stay on email Q&A for the following week.
Step 04 · Optional · 2 weeks
Implementation (if you want it)
If you'd rather I ship it: 2-week implementation engagement. Shadow tests, traffic ramp, observability counters wired up, 30 days of post-launch tuning, final eng-team walkthrough.

Start with a discovery call Or send a quick note instead

Why aimargin

Three ways to fix this. Only one is fixed-fee, code-shipped, and refundable.

There's nothing magic here. The 1-week timeline + production knowledge is the only thing that justifies the price — and the refund is the only thing that justifies the engagement.

DIY

Generalist consultant

aimargin

Provider-specific knowledge

Self-taught

Broad, shallow

Production operator

Time investment

80+ hours of yours

4-week engagement

1 week

Code shipped

TS routing layer + 5 clients

Quality safeguards

Manual / hopeful

Manual

Shadow tests + diff harness

Refund

100% within 7 days

Vendor independence

Often biased

No kickbacks, ever

Price

(your time)

$20–50K

$5K · or $2.5K (founding 100)

Pricing

Two tiers. Both fixed-fee. Both refundable.

No retainer. No success fees. No 'we'll talk about price after the conversation.' If the audit doesn't make the case for the implementation, the implementation doesn't happen.

Audit

$5,000

fixed · 1 week · 100% refund within 7 days

The full report, the routing layer, and the rollout plan — built for one stack.

12-section audit report (PDF + Markdown)
Drop-in routeByTask() implementation
5 typed provider clients (OpenAI, Anthropic, Vertex, Groq, Cerebras)
30/60/90-day rollout plan with shadow-mode + traffic ramp
30-min delivery walkthrough
1 week of email Q&A (no extra charge)

Book discovery call

First 100 customers: $2,500 (50% off) with code FOUNDING50.

Optional add-on

Implementation

$15,000

fixed · 2 weeks · post-audit only

If you'd rather I ship the rollout for you. Discussed at the end of the audit, never before.

I land the routing in your repo
Shadow mode + 10/50/100 traffic ramp
Observability counters wired up (cost, latency, fallback rate)
Daily-cost Slack digest configured
30 days of post-launch tuning
Final eng-team walkthrough

Discuss after audit

Pay only if the audit makes the case.

A note from the founder

I learned this the hard way. You don’t have to.

I run a consumer app called Bailar that classifies ~40,000 events a month with text LLMs and runs ~10,000 images a month through vision models. I started on OpenAI-only. The bill was wild.

Routing the right call to the right provider — Groq for plain-text classify, Vertex Gemini Flash for vision, Anthropic Sonnet for brand-visible writing, Cerebras as failover — cut the LLM bill ~60% versus the OpenAI-only baseline. Quality stayed inside a 1% blind-eval delta. It wasn’t magic. It was discipline + a benchmark harness.

The same pattern reclaims 40–60% on most AI startups in the seed-to-Series-A range. The audit takes a week because the analysis is surgical — not because it’s hard, but because each call type needs its own quality vetting before the routing flips on.

I burned a weekend to a GCP free-tier suspension last year. There’s a right way to multi-source — including the boring “don’t do this” rules nobody writes down. I’ll show you both halves.

— Paul Plawin · founder · paul@aimargin.dev · LinkedIn

Want to dig deeper before you book?

Four pages of detail on the process, the pricing, the case study, and me. No gating, no email-walls.

FAQ

Twelve answers to the questions every CTO asks me first.

If your question isn't here, drop a note at the bottom of the page — I read every one within 24 hours.

Get your AI bill audited

Two ways in. Same person on the other end.

Book a 15-minute discovery call directly, or drop a note and I'll pick up the thread within 24 hours. Either way, you're talking to the operator who built it.

Book the discovery call

15 min, free, no commitment. We’ll talk about your stack, where the cost is, and whether the audit is the right fit.

Cut your AI bill 40–60%in two weeks.

We route across 15+ AI providers in production.

Nine ways your AI bill is bigger than it has to be.

Single-provider lock-in

Cost opacity

Quality fear

Vision tax

Spec stuffing

Caching & batch lying dormant

Retry storms

Vendor risk

No observability

Same workload. Two routing chains. Sixty percent off.

One bill. One vendor. One unpriced model on every call.

Same workload. Right model per task. Bill collapses.

Pareto frontier — score per dollar, by call type

Cheapest model at every quality tier (Text)

Four artifacts that pay for the engagement on day one.

12-section audit report

Production-ready TypeScript

30/60/90-day rollout plan

Delivery walkthrough + week of Q&A

Four steps. One week. Zero surprises.

Discovery call

Access + analysis

Recommendations + code

Implementation (if you want it)

Three ways to fix this. Only one is fixed-fee, code-shipped, and refundable.

Two tiers. Both fixed-fee. Both refundable.

Audit

Implementation

I learned this the hard way. You don’t have to.

Want to dig deeper before you book?

The process, in detail

Full pricing breakdown

Bailar case study

About Paul

Twelve answers to the questions every CTO asks me first.

Two ways in. Same person on the other end.

Book the discovery call

Not ready to book? Send a quick note.

Cut your AI bill 40–60%
in two weeks.