AI Game benchmarks

12 game tasks where every frontier AI model gets the same one-shot prompt. Live, playable demos. Real 0–10 scores from Julian Goldie.

Top model in this category

Opus 4.8 →

What I'm testing in the Game category

Game tasks are the most useful AI model test on the bench, in my opinion. Building a playable game in one shot exposes whether the model can do three things at once: handle geometry/physics/input properly, structure a coherent game loop, and not silently fail in a way that ships a 'works but isn't a game' page. The gap between gold and bronze on game tasks is enormous — the gold demos genuinely play; the bronze demos render but you can't do anything in them.

Every Game task on the bench

12 tasks, 26 total demos across all models. Click any task to see how every AI model handled the same prompt — side by side, live and playable.

Game

Arcade

Arcade — classic arcade-style game (pick: tetris, breakout, snake).

5models

Game

Crypt

Crypt — torch-lit dungeon crawler.

1models

Game

Dogfight

Dogfight — air-combat shooter.

1models

Game

Doom

Doom — put monsters in the raycaster maze and let them chase you.

3models

Game

Game — generic 'make a game' open prompt.

3models

Game

Neoncity

Neon City — cyberpunk neon-lit city you drive through.

3models

Game

Outrun

Outrun — synthwave horizon driving game with pseudo-3D road.

3models

Game

Pool

Pool — physically simulated billiards game.

1models

Game

Racing

3D Racer — third-person racing game with a track and obstacles.

1models

Game

Raycaster

Raycaster Maze — build a Wolfenstein-style 3D maze you can walk through.

3models

Game

Rpg

RPG — top-down RPG with sprites, combat, inventory.

1models

Game

Skyrim

Skyrim-lite — first-person open-world fantasy explorer.

1models

How I score Game tasks

Same three axes as the rest of the bench: runs (does the .html open to a working page), hits the brief (is the thing I asked for what came back), looks good (visual polish, motion, attention to detail). 0–10 each, averaged. Highest score on each task earns gold; second silver; third bronze. Models without a 0–10 verdict are listed as unranked on the leaderboard.

Source guides for the Game category: see the methodology page for full data provenance.

Other categories: Page, Sim, Visual · all tasks · all models

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders

258documented wins

38countries

$100k+/mocommunity MRR

Join AIPB · $59/mo → Read the Agent OS guides →

AI Game benchmarks

What I'm testing in the Game category

Every Game task on the bench

How I score Game tasks

Related

Run this stack yourself.

Join 3,600+ founders building with this stack.