Arena Became a $100M Company, But There's No Tool to Build Evals for Your Own Product

The Problem

LMArena began as a site that put two model outputs side by side and let people vote for the better one. That site became a company that raised $100M, and its leaderboard is the first scoreboard OpenAI, Google, and Anthropic check whenever they ship a new model. The catch: that ranking reflects “the average user asking average questions.” Which model answers best inside your product, on your data, for the questions your users actually send, none of that is on the board. So product teams paste a handful of outputs into a spreadsheet and compare by eye every time they swap models. A $100M business sells the ranking; the tool to build an eval that fits your own product sits empty underneath it.

Why Now

Model churn has gone from quarterly to weekly. GPT, Claude, and Gemini ship new versions weeks apart, and price and latency move with them. A prompt that worked yesterday quietly breaks on the new model, and without a regression test to catch it, you find out from user complaints after shipping. The pressure runs the other way too: teams want to move the same job onto a smaller, cheaper model to survive the compute bill, which means proving the swap holds. Models got cheap and plentiful; the cost of answering “which one fits my job?” stayed stuck on human hands. Evaluation is the bottleneck now.

How to Build It

Three pieces.

First, build the eval set from real product traffic. Sample representative cases from production logs, strip sensitive data, attach rubrics and ground truth, and freeze a golden set. Real user questions form the spine, not made-up examples.

Second, automate scoring but calibrate it. Score responses with an LLM judge, then tune the judge against a small set of human labels until they agree, and bake in the rule that any score where they diverge is untrustworthy.

Third, make regression continuous. Every time you touch model, prompt, or temperature, run the same golden set and surface the score delta. You see what improves and what breaks before you ship.

quadrantChart
  title Eval tooling landscape
  x-axis Generic --> Product-specific
  y-axis Manual --> Automated
  quadrant-1 Open gap
  quadrant-2 Public leaderboards
  quadrant-3 One-off benchmarks
  quadrant-4 Spreadsheet checks
  Arena: [0.2, 0.85]
  Internal spreadsheets: [0.8, 0.2]
  Your product eval: [0.85, 0.85]

The wedge is teams that just put a model into production, especially ones that swap often. Sell “connect once, and every model after this gets compared automatically” to teams already worn out by spreadsheet comparisons. Pricing rides eval-run volume plus regression monitoring on top.

Success Criteria

Three things decide it. First, judge trust: the verdict “model A is better for our domain” has to match human judgment, or teams won’t move decisions on the number, and a judge that hands out junk scores gets the whole tool thrown out. Second, integration weight: connecting one log source should be enough to stand up an eval set; any more friction and adoption stalls. Third, domain depth: what separates this from a generic benchmark is your product’s context. The better it captures the grain a public leaderboard can’t, legal, medical, a specific language, the more it becomes the first place teams go for a product-fit eval.