StartupXO
Language

Language

Dev Tools & Infra

Arena Became a $100M Company, But There's No Tool to Build Evals for Your Own Product

Published: 2026-06-30

Model EvalsLLM EvaluationLeaderboardsLLM-as-JudgeEval Infrastructure

The Problem

Public leaderboards rank models on an average. Founders need to know which model and prompt answer best for the questions their users actually ask inside their product, and the two often disagree. The Arena's #1 can be your domain's #3. But with no tool to build product-fit evals, most teams paste a few outputs into a spreadsheet and pick by gut, redoing the work from scratch every time they swap models.

Why Now

Model churn has dropped to weekly, and pressure to run the same task on a smaller, cheaper model keeps rising, so more teams must answer 'which one?' with data, not vibes. The public-leaderboard business hit $100M, but the product-level eval layer beneath it is empty. Tooling that bundles eval-set construction, LLM-judge calibration, and regression tracking sits where demand is fixed by both cost and regulation (high-risk AI verification).

Recommended Talent

A data engineer who can mine production logs into eval sets, an ML engineer who calibrates an LLM judge against human labels, and a product engineer who builds the regression dashboard and diff UX. Add an applied scientist to make the methodology trustworthy and a developer-facing seller who can land teams that just shipped a model to production.

The Problem

LMArena began as a site that put two model outputs side by side and let people vote for the better one. That site became a company that raised $100M, and its leaderboard is the first scoreboard OpenAI, Google, and Anthropic check whenever they ship a new model. The catch: that ranking reflects “the average user asking average questions.” Which model answers best inside your product, on your data, for the questions your users actually send, none of that is on the board. So product teams paste a handful of outputs into a spreadsheet and compare by eye every time they swap models. A $100M business sells the ranking; the tool to build an eval that fits your own product sits empty underneath it.

Why Now

Model churn has gone from quarterly to weekly. GPT, Claude, and Gemini ship new versions weeks apart, and price and latency move with them. A prompt that worked yesterday quietly breaks on the new model, and without a regression test to catch it, you find out from user complaints after shipping. The pressure runs the other way too: teams want to move the same job onto a smaller, cheaper model to survive the compute bill, which means proving the swap holds. Models got cheap and plentiful; the cost of answering “which one fits my job?” stayed stuck on human hands. Evaluation is the bottleneck now.

How to Build It

Three pieces.

First, build the eval set from real product traffic. Sample representative cases from production logs, strip sensitive data, attach rubrics and ground truth, and freeze a golden set. Real user questions form the spine, not made-up examples.

Second, automate scoring but calibrate it. Score responses with an LLM judge, then tune the judge against a small set of human labels until they agree, and bake in the rule that any score where they diverge is untrustworthy.

Third, make regression continuous. Every time you touch model, prompt, or temperature, run the same golden set and surface the score delta. You see what improves and what breaks before you ship.

quadrantChart
  title Eval tooling landscape
  x-axis Generic --> Product-specific
  y-axis Manual --> Automated
  quadrant-1 Open gap
  quadrant-2 Public leaderboards
  quadrant-3 One-off benchmarks
  quadrant-4 Spreadsheet checks
  Arena: [0.2, 0.85]
  Internal spreadsheets: [0.8, 0.2]
  Your product eval: [0.85, 0.85]

The wedge is teams that just put a model into production, especially ones that swap often. Sell “connect once, and every model after this gets compared automatically” to teams already worn out by spreadsheet comparisons. Pricing rides eval-run volume plus regression monitoring on top.

Success Criteria

Three things decide it. First, judge trust: the verdict “model A is better for our domain” has to match human judgment, or teams won’t move decisions on the number, and a judge that hands out junk scores gets the whole tool thrown out. Second, integration weight: connecting one log source should be enough to stand up an eval set; any more friction and adoption stalls. Third, domain depth: what separates this from a generic benchmark is your product’s context. The better it captures the grain a public leaderboard can’t, legal, medical, a specific language, the more it becomes the first place teams go for a product-fit eval.

Build this together

Find collaborators