How the leaderboard works

The methodology

1. Same prompt for every model

Each task in the matrix is a single, fixed prompt — "build a raycaster maze you can walk through", "make a fluid simulation", "ship a top-down RPG". Every model gets the exact same string. We don't help them. We don't iterate. We don't pick the best of N tries. One shot, what comes out is what's on the board.

2. One self-contained HTML file

Every result has to be a single, self-contained .html file that runs in a browser. No build steps, no servers, no external dependencies beyond CDN-loaded libraries. If you can't drop it into your browser and have it work, it doesn't count.

3. Where the scores come from — data provenance

Every score on this leaderboard is Julian's own 0–10 number from his head-to-head guides. Nothing is fabricated, nothing is inferred. The extractor parses the actual scoreboard blocks out of these source guides:

  • GLM-5.2 vs Kimi K2.7 vs Opus 4.8 — 14 scored tasks with Julian's "honest scoring" out of 10, averaged. Overall: Opus 8.4, GLM 8.3, Kimi 7.3.
  • GLM-5.2 vs Qwen 3.7 vs Opus 4.8 — 5 scored tasks adding Qwen 3.7 to the bench.
  • Kimi K2.7 modes head-to-head — Julian explicitly says "no scores from me" for the three Kimi modes (fast/no-think/quality), so those modes are unranked on this bench. The demos are still here; the medals aren't.
  • Three Dragons™ — qualitative writeup, no per-task 0–10 scores, so it informs the model taglines but not the medals.

45 of 88 cells have curated verdicts from these sources. Cells without a verdict are still on the bench — Julian hasn't scored them yet. You can play any demo and form your own opinion.

4. Medals — derived from Julian's scores

Per task, the highest 0–10 score gets 🥇 gold, second-best 🥈 silver, third 🥉 bronze. Ties at the top share gold (matching Julian's own "tie · top" labels in glm-vs-qwen-vs-opus).

Leaderboard ranking = average 0–10 score across all tasks the model was scored on. Models with zero scored tasks are listed as unranked (Grok, the three Kimi modes) — their demos count for coverage but not for ranking.

5. The Agent OS context

Every demo was built inside Julian's Agent OS — one prompt, one shot, single HTML file out. The bench is a slice of his daily work, not a synthetic benchmark. When he tests a new model, he runs the same fixed prompts through Agent OS and saves the resulting HTML. Those files become the cells you see on this page.

5. Why this isn't HumanEval / MMLU / SWE-bench

Standard benchmarks measure narrow things — "can you write code that passes a test", "can you answer a multiple-choice question". They're useful for vendors. They're not how you actually use these models.

Goldie Bench measures what happens when you ask a frontier model to ship a thing — a playable game, a working simulation, a deployable page — in one prompt. That's the test most of the buyer audience cares about. It's also the one most leaderboards skip.

6. New models, new tasks

The data set grows. New models get the same prompts, retroactively. New tasks join when someone in the AIPB community asks "can the AI do X in one shot?" and the answer is interesting either way.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR