Every AI model. Every task.
Ranked.
The Goldie AI Model Leaderboard — every model, every task, every demo, ranked. Click into any cell for the live, playable demo — every model gets the same prompt, every result is on the same page.
The leaderboard
Ranking = Julian's actual 0–10 scores averaged across all tasks both ran. Medals = per-task rank (highest = 🥇 gold, second = 🥈 silver, third = 🥉 bronze). Models without scored verdicts are unranked — their demos are still on the bench, but Julian hasn't put a number on them yet.
Scores extracted from Julian's GLM-5.2 vs Kimi K2.7 vs Opus 4.8 and GLM-5.2 vs Qwen 3.7 vs Opus 4.8 guides. See methodology for data provenance.
| Rank | Model | Tasks | Medals | Avg score / 10 |
|---|---|---|---|---|
| #1 |
Opus 4.8
Anthropic
|
17 tasks13/17 scored | 🥇 8🥈 5 | 8.46 |
| #2 |
GLM-5.2
Zhipu / Z.ai
|
21 tasks13/21 scored | 🥇 6🥈 4🥉 3 | 8.23 |
| #3 |
Qwen 3.7
Alibaba
|
5 tasks5/5 scored | 🥈 3🥉 2 | 7.50 |
| #4 |
Kimi K2.7
Moonshot AI
|
23 tasks14/23 scored | 🥇 3🥈 2🥉 9 | 7.25 |
| — |
Grok
xAI
|
13 tasks0/13 scored | — | unranked |
| — |
Kimi K2.7 · Fast
Moonshot AI
|
3 tasks0/3 scored | — | unranked |
| — |
Kimi K2.7 · No-Think
Moonshot AI
|
3 tasks0/3 scored | — | unranked |
| — |
Kimi K2.7 · Quality
Moonshot AI
|
3 tasks0/3 scored | — | unranked |
The matrix — tasks × models
Each cell is a live, playable demo. Click any thumbnail to run that model's attempt at that task. Empty cells mean we haven't tested that combo yet.
Showing the 20 most-tested tasks. See all 32 tasks →
The models on the bench
Eight AI models tested so far — every frontier release that ships with the kind of capabilities you'd actually wire into an agent stack. Click any card for the full model review, every demo, and direct head-to-head comparisons.
What is Goldie Bench?
Goldie Bench is a one-shot AI model leaderboard. Every model gets the same fixed prompt, single HTML file out, live on the page. No iteration. No "best of N." What came out on the first run is what's on the matrix — and I score each result 0–10 on whether it ran, how close it hit the brief, and how good it looked.
I'm Julian Goldie. I run the AI Profit Boardroom (3,600+ founders) and build the Agent Operating System — a Mac-native dashboard where my AI crew dispatches frontier models from one shared kanban. Every demo on this page was built inside that stack. The bench is a slice of my daily work, not a synthetic benchmark.
Why I built this AI model leaderboard
Standard AI benchmarks — MMLU, HumanEval, SWE-bench, LiveCodeBench — measure narrow things. Did the model output text that matches an expected answer? Did the code compile? Did the patch pass the test? These benchmarks are useful for vendors who want to make a number go up. They have almost nothing to do with whether a frontier model can ship a thing on the first prompt.
The buyer audience for these models — founders, builders, indie devs, agencies, anyone running AI agents day to day — needs a different question answered: can you ask the model for a playable game, a working simulation, a deployable page, and have it ship in one prompt? That's the test that decides whether you'd wire the model into your stack.
Goldie Bench is that test. Same fixed prompt for every model. Single HTML out. Live on the page. Real 0–10 score from me, posted publicly with the reasoning. No vendor pays me. No score gets buried.
How I test — the methodology in 60 seconds
- I pick a creative coding prompt a frontier AI model should be able to ship in one shot — raycaster maze, fluid sim, neon city flythrough, top-down RPG, landing page.
- I dispatch the exact same prompt to each model from the kanban inside the Agent Operating System. No system-prompt cheats, no few-shot tricks.
- I save whatever .html file the model produces on the first run. No iteration. No coaching.
- I score each result 0–10 on three axes: did it run, did it hit the brief, did it look good.
- I publish the scores publicly in the source comparison guides on agentos.guide — and that's the data behind every cell on this page.
Full data provenance on the methodology page.
What makes Goldie Bench different from other AI model leaderboards
- Live demos, not numbers in a table. Click any cell on the matrix and the model's actual one-shot HTML opens in a new tab. You don't have to trust my score — you can play the result and form your own opinion.
- Real 0–10 scores from a working operator, not auto-evaluated against a reference. I run the models inside the same Agent OS my 3,600+ community uses daily, and I score on what would actually ship.
- Same prompt across every model — no per-vendor handicapping. Every model sees the exact same string.
- Publicly sourced data. Every score traces back to a published comparison guide on agentos.guide. Cells without a score are honestly marked as unranked.
- Updated daily-ish. When a new model ships, I run the same prompts through it on camera (you can watch on YouTube, 400k+ subscribers) and the bench updates.
FAQ — about the bench
What's the best AI model right now?
Across 45 scored cells from my source guides, Opus 4.8 tops the leaderboard at 8.46/10 average — most consistent across game-feel, raycasters, and one-shot reasoning. GLM-5.2 is the open-weights challenger at 8.23/10 with the prettiest visual builds. Pick by use case — see Opus 4.8 vs GLM-5.2 for the head-to-head.
Is Goldie Bench the same as the LMArena, Hugging Face, or Chatbot Arena leaderboard?
No. Those rank models on human-preference Elo votes or quiz-style benchmarks. Goldie Bench ranks on a single fixed creative-coding task per cell — same prompt, one shot, scored 0–10 by me. Different methodology, different answer.
How often do new models get added?
Whenever a new frontier release lands. I usually wire it into the Agent OS the same week, run the fixed prompt set through it on camera, and the bench updates. Recent additions: GLM-5.2, Kimi K2.7, Qwen 3.7.
Can I add a model or task?
Ideas land best inside the AI Profit Boardroom community. Members propose tasks, I test, scores get added to the bench.
Are these scores affiliate-driven or vendor-sponsored?
No. No vendor pays me. No score is gated. The Agent OS that runs these tests is built and maintained by me; the community that powers the workflow lives inside AIPB. That's the only commercial layer on this site.
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.