An open model leaderboard

Every AI model. Every task.
Ranked.

The Goldie AI Model Leaderboard — every model, every task, every demo, ranked. Click into any cell for the live, playable demo — every model gets the same prompt, every result is on the same page.

88Live demos
32Tasks
8Models
45Curated verdicts

The leaderboard

Ranking = Julian's actual 0–10 scores averaged across all tasks both ran. Medals = per-task rank (highest = 🥇 gold, second = 🥈 silver, third = 🥉 bronze). Models without scored verdicts are unranked — their demos are still on the bench, but Julian hasn't put a number on them yet.

Scores extracted from Julian's GLM-5.2 vs Kimi K2.7 vs Opus 4.8 and GLM-5.2 vs Qwen 3.7 vs Opus 4.8 guides. See methodology for data provenance.

RankModelTasksMedalsAvg score / 10
#1
Opus 4.8 Anthropic
17 tasks13/17 scored 🥇 8🥈 5 8.46
#2
GLM-5.2 Zhipu / Z.ai
21 tasks13/21 scored 🥇 6🥈 4🥉 3 8.23
#3
Qwen 3.7 Alibaba
5 tasks5/5 scored 🥈 3🥉 2 7.50
#4
Kimi K2.7 Moonshot AI
23 tasks14/23 scored 🥇 3🥈 2🥉 9 7.25
Grok xAI
13 tasks0/13 scored unranked
Kimi K2.7 · Fast Moonshot AI
3 tasks0/3 scored unranked
Kimi K2.7 · No-Think Moonshot AI
3 tasks0/3 scored unranked
Kimi K2.7 · Quality Moonshot AI
3 tasks0/3 scored unranked

The matrix — tasks × models

Each cell is a live, playable demo. Click any thumbnail to run that model's attempt at that task. Empty cells mean we haven't tested that combo yet.

Task ↓ · Model →
Opus 4.8
GLM-5.2
Qwen 3.7
Kimi K2.7
Grok
Kimi K2.7
Fast
Kimi K2.7
No-Think
Kimi K2.7
Quality
Sim
🥇 Opus 4.8 on Galaxy
🥈 GLM-5.2 on Galaxy
🥈 Kimi K2.7 on Galaxy
Grok on Galaxy
Kimi K2.7 · Fast on Galaxy
Kimi K2.7 · No-Think on Galaxy
Kimi K2.7 · Quality on Galaxy
Sim
🥇 Opus 4.8 on Solar
🥇 GLM-5.2 on Solar
🥉 Kimi K2.7 on Solar
Kimi K2.7 · Fast on Solar
Kimi K2.7 · No-Think on Solar
Kimi K2.7 · Quality on Solar
Game
🥇 Opus 4.8 on Arcade
🥈 GLM-5.2 on Arcade
🥈 Qwen 3.7 on Arcade
🥈 Kimi K2.7 on Arcade
Grok on Arcade
Sim
🥈 Opus 4.8 on Fluid
🥇 GLM-5.2 on Fluid
🥉 Kimi K2.7 on Fluid
Grok on Fluid
Page
🥇 Opus 4.8 on Landing
🥇 GLM-5.2 on Landing
🥉 Qwen 3.7 on Landing
🥉 Kimi K2.7 on Landing
Grok on Landing
Sim
🥇 Opus 4.8 on Orbit
🥈 GLM-5.2 on Orbit
🥈 Qwen 3.7 on Orbit
🥉 Kimi K2.7 on Orbit
Grok on Orbit
Visual
🥈 Opus 4.8 on Voxel
🥇 GLM-5.2 on Voxel
🥉 Qwen 3.7 on Voxel
🥉 Kimi K2.7 on Voxel
Grok on Voxel
Sim
🥇 Opus 4.8 on Blackhole
🥈 GLM-5.2 on Blackhole
🥉 Kimi K2.7 on Blackhole
Grok on Blackhole
Sim
🥈 Opus 4.8 on Fractal
🥉 GLM-5.2 on Fractal
🥇 Kimi K2.7 on Fractal
Grok on Fractal
Game
🥇 Opus 4.8 on Doom
🥉 GLM-5.2 on Doom
🥇 Kimi K2.7 on Doom
Game
Kimi K2.7 · Fast on Game
Kimi K2.7 · No-Think on Game
Kimi K2.7 · Quality on Game
Game
🥈 Opus 4.8 on Neoncity
🥇 GLM-5.2 on Neoncity
🥉 Kimi K2.7 on Neoncity
Game
🥇 Opus 4.8 on Outrun
🥇 GLM-5.2 on Outrun
🥉 Kimi K2.7 on Outrun
Opus 4.8 on Pathtracer
Game
🥈 Opus 4.8 on Raycaster
🥉 GLM-5.2 on Raycaster
🥇 Kimi K2.7 on Raycaster
Visual
Opus 4.8 on Terrain
GLM-5.2 on Terrain
Kimi K2.7 on Terrain
Sim
Kimi K2.7 on Boids
Grok on Boids
Sim
Opus 4.8 on Cloth
Visual
Kimi K2.7 on Lavalamp
Grok on Lavalamp
Opus 4.8 on Reactiondiff

Showing the 20 most-tested tasks. See all 32 tasks →

What is Goldie Bench?

Goldie Bench is a one-shot AI model leaderboard. Every model gets the same fixed prompt, single HTML file out, live on the page. No iteration. No "best of N." What came out on the first run is what's on the matrix — and I score each result 0–10 on whether it ran, how close it hit the brief, and how good it looked.

I'm Julian Goldie. I run the AI Profit Boardroom (3,600+ founders) and build the Agent Operating System — a Mac-native dashboard where my AI crew dispatches frontier models from one shared kanban. Every demo on this page was built inside that stack. The bench is a slice of my daily work, not a synthetic benchmark.

Why I built this AI model leaderboard

Standard AI benchmarks — MMLU, HumanEval, SWE-bench, LiveCodeBench — measure narrow things. Did the model output text that matches an expected answer? Did the code compile? Did the patch pass the test? These benchmarks are useful for vendors who want to make a number go up. They have almost nothing to do with whether a frontier model can ship a thing on the first prompt.

The buyer audience for these models — founders, builders, indie devs, agencies, anyone running AI agents day to day — needs a different question answered: can you ask the model for a playable game, a working simulation, a deployable page, and have it ship in one prompt? That's the test that decides whether you'd wire the model into your stack.

Goldie Bench is that test. Same fixed prompt for every model. Single HTML out. Live on the page. Real 0–10 score from me, posted publicly with the reasoning. No vendor pays me. No score gets buried.

How I test — the methodology in 60 seconds

  1. I pick a creative coding prompt a frontier AI model should be able to ship in one shot — raycaster maze, fluid sim, neon city flythrough, top-down RPG, landing page.
  2. I dispatch the exact same prompt to each model from the kanban inside the Agent Operating System. No system-prompt cheats, no few-shot tricks.
  3. I save whatever .html file the model produces on the first run. No iteration. No coaching.
  4. I score each result 0–10 on three axes: did it run, did it hit the brief, did it look good.
  5. I publish the scores publicly in the source comparison guides on agentos.guide — and that's the data behind every cell on this page.

Full data provenance on the methodology page.

What makes Goldie Bench different from other AI model leaderboards

  • Live demos, not numbers in a table. Click any cell on the matrix and the model's actual one-shot HTML opens in a new tab. You don't have to trust my score — you can play the result and form your own opinion.
  • Real 0–10 scores from a working operator, not auto-evaluated against a reference. I run the models inside the same Agent OS my 3,600+ community uses daily, and I score on what would actually ship.
  • Same prompt across every model — no per-vendor handicapping. Every model sees the exact same string.
  • Publicly sourced data. Every score traces back to a published comparison guide on agentos.guide. Cells without a score are honestly marked as unranked.
  • Updated daily-ish. When a new model ships, I run the same prompts through it on camera (you can watch on YouTube, 400k+ subscribers) and the bench updates.

FAQ — about the bench

What's the best AI model right now?

Across 45 scored cells from my source guides, Opus 4.8 tops the leaderboard at 8.46/10 average — most consistent across game-feel, raycasters, and one-shot reasoning. GLM-5.2 is the open-weights challenger at 8.23/10 with the prettiest visual builds. Pick by use case — see Opus 4.8 vs GLM-5.2 for the head-to-head.

Is Goldie Bench the same as the LMArena, Hugging Face, or Chatbot Arena leaderboard?

No. Those rank models on human-preference Elo votes or quiz-style benchmarks. Goldie Bench ranks on a single fixed creative-coding task per cell — same prompt, one shot, scored 0–10 by me. Different methodology, different answer.

How often do new models get added?

Whenever a new frontier release lands. I usually wire it into the Agent OS the same week, run the fixed prompt set through it on camera, and the bench updates. Recent additions: GLM-5.2, Kimi K2.7, Qwen 3.7.

Can I add a model or task?

Ideas land best inside the AI Profit Boardroom community. Members propose tasks, I test, scores get added to the bench.

Are these scores affiliate-driven or vendor-sponsored?

No. No vendor pays me. No score is gated. The Agent OS that runs these tests is built and maintained by me; the community that powers the workflow lives inside AIPB. That's the only commercial layer on this site.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR