Every benchmark task

All AI benchmark tasks

50 one-shot creative coding tasks across 5 categories — every task is a fixed prompt I send to every frontier model on the bench. Same input, single HTML file out, scored 0–10 on three axes. Click any task to see how each model handled the same prompt, side by side.

Game

Arcade

Arcade — classic arcade-style game (pick: tetris, breakout, snake).

31models

Game

Crypt

Crypt — torch-lit dungeon crawler.

31models

Game

Dogfight

Dogfight — air-combat shooter.

31models

Game

Doom

Doom — put monsters in the raycaster maze and let them chase you.

32models

Game

Dragonflight

Dragon Flight — fly a dragon through neon rings at speed, full HUD, fire-breath, fury meter.

31models

Game

Dragonrealm

The Dragon Realm — Skyrim-style frozen open world, walk-into-the-snow, draw your sword. Julian's flagship deep-build prompt.

33models

Game

Flightsim

Flight Simulator — take off, fly over terrain, full flight HUD, land on the runway.

23models

Game

Game — generic 'make a game' open prompt.

31models

Game

Gtadrive

GTA Drive — open-city driving sandbox: steal cars, outrun cops, traffic, wanted level, minimap.

23models

Game

Gtafoot

GTA On-Foot — third-person city streets on foot: weapons, shooting, pedestrians, cover, wanted stars.

23models

Game

Neonblaster

Neon Blaster — juicy arcade space shooter, waves, bosses, power-ups, screen-shake, synth music.

30models

Game

Neoncity

Neon City — cyberpunk neon-lit city you drive through.

30models

Game

Neonracer

Neon Racer — fullscreen neon racer with vapor-trail particle effects.

31models

Game

Nordiccrypt

Nordic Crypt — torch-lit Nordic dungeon crawler, ancient ruin to explore, first-person.

31models

Game

Outrun

Outrun — synthwave horizon driving game with pseudo-3D road.

31models

Game

Parachute

Parachute Drop — jump from a plane, freefall, pull the chute, steer to land in a jungle clearing.

22models

Game

Pool

Pool — physically simulated billiards game.

29models

Game

Racing

3D Racer — third-person racing game with a track and obstacles.

31models

Game

Raycaster

Raycaster Maze — build a Wolfenstein-style 3D maze you can walk through.

30models

Game

Rpg

RPG — top-down RPG with sprites, combat, inventory.

31models

Game

Skyrim

Skyrim-lite — first-person open-world fantasy explorer.

32models

Game

Twilightvale

Twilight Vale — 3D open-world RPG with combat, terrain, weather.

30models

Game

Voxelcraft

Voxel Craft — Minecraft-style sandbox, place + break blocks, day/night cycle.

29models

Other

Matrixrain

Matrixrain — auto-discovered task.

10models

Other

Mlx Speedtest

Mlx Speedtest — auto-discovered task.

10models

Other

Neonsnake

Neonsnake — auto-discovered task.

10models

Page

Aipbpromo

AIPB Promo — Remotion-style cinematic motion-graphics video advert for the AI Profit Boardroom, auto-playing scenes, animated stats, end CTA.

23models

Page

Landing

Landing Page — modern marketing landing page (one-shot).

30models

Page

Webos

Web-OS Desktop — a working desktop with windows, dock, Notes / Paint / Terminal apps.

30models

Sim

Blackhole

Black Hole — gravitational lensing visualisation.

30models

Sim

Boids

Boids — flocking-birds emergent behaviour simulation.

30models

Sim

Cloth

Cloth — physical cloth simulation draping over an object.

30models

Sim

Fluid

Fluid — WebGL fluid simulation with swirling particles.

30models

Sim

Fractal

Fractal — interactive fractal explorer (mandelbrot or julia).

30models

Sim

Galaxy

Galaxy — particle galaxy you can swirl with your mouse.

30models

Sim

Orbit

Orbit — N-body gravitational simulation.

30models

Sim

Particleforge

Particle Forge — sculpt swirling particle systems with mouse gravity.

30models

Sim

Pathtracer

Path Tracer — physically-correct ray-traced renderer.

29models

Sim

Reactiondiff

Reaction-Diffusion — Turing pattern generator.

30models

Sim

Solar

Solar — accurate planetary solar system.

30models

Sim

Wormhole

Wormhole — 3D wormhole tunnel flythrough.

30models

Visual

Aurora

Aurora — northern lights animation.

31models

Visual

Fireworks

Fireworks — interactive fireworks display.

30models

Visual

Lavalamp

Lava Lamp — slow blob morph animation.

30models

Visual

Matrix

Matrix — Matrix-rain falling-glyphs animation.

30models

Visual

Plasma

Plasma — hypnotic full-screen plasma effect with palette switcher, click ripples.

30models

Visual

Synthwave

Synthwave — sunset-grid synthwave loop.

30models

Visual

Terrain

Terrain — procedural 3D terrain explorer.

30models

Visual

Voxel

Voxel — voxel-art landscape (Minecraft-style).

30models

Visual

Waves

Waves — animated ocean wave simulation.

30models

Why these tasks?

Every task on this page is a creative coding prompt that frontier AI models should be able to ship in one shot. The categories were picked to stress different capabilities: Game tasks test geometry + game-loop + input + state; Sim tasks test math + visual taste; Visual tasks test pure aesthetic judgment; Page tasks test product instinct.

If a model can ship the Game tasks cleanly, it's safe to wire into an agent loop. If it can ship Visual cleanly, it has design taste — useful for content workflows. The benches you read about online (HumanEval, MMLU, SWE-bench) measure a different thing entirely; this bench measures what shows up when you ask a frontier model for a working artifact.

How I add a new task

A new task lands when one of three things happens: (1) a new frontier model ships and an existing task needs an updated baseline; (2) someone inside the AI Profit Boardroom proposes one and we agree it's a useful test; (3) I'm building something with the Agent OS and the prompt itself is novel enough to deserve a permanent cell.

The bar is: would a working operator try the prompt themselves? If yes — it's a task. If it's a synthetic stress test of one capability, it's probably better measured by HumanEval-style benchmarks; that's not what Goldie Bench is for.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 4,000+ founders shipping with it every day all live inside the AI Profit Boardroom.

4,000+founders

258documented wins

38countries

$59/momonthly

Join AIPB · $59/mo → Read the Agent OS guides →