Category

AI Game benchmarks

12 game tasks where every frontier AI model gets the same one-shot prompt. Live, playable demos. Real 0–10 scores from Julian Goldie.

Top model in this category
Opus 4.8 →

What I'm testing in the Game category

Game tasks are the most useful AI model test on the bench, in my opinion. Building a playable game in one shot exposes whether the model can do three things at once: handle geometry/physics/input properly, structure a coherent game loop, and not silently fail in a way that ships a 'works but isn't a game' page. The gap between gold and bronze on game tasks is enormous — the gold demos genuinely play; the bronze demos render but you can't do anything in them.

How I score Game tasks

Same three axes as the rest of the bench: runs (does the .html open to a working page), hits the brief (is the thing I asked for what came back), looks good (visual polish, motion, attention to detail). 0–10 each, averaged. Highest score on each task earns gold; second silver; third bronze. Models without a 0–10 verdict are listed as unranked on the leaderboard.

Source guides for the Game category: see the methodology page for full data provenance.

Related

Other categories: Page, Sim, Visual · all tasks · all models

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR