All AI benchmark tasks
32 one-shot creative coding tasks across 4 categories — every task is a fixed prompt I send to every frontier model on the bench. Same input, single HTML file out, scored 0–10 on three axes. Click any task to see how each model handled the same prompt, side by side.
Game
Page
Sim
Visual
Why these tasks?
Every task on this page is a creative coding prompt that frontier AI models should be able to ship in one shot. The categories were picked to stress different capabilities: Game tasks test geometry + game-loop + input + state; Sim tasks test math + visual taste; Visual tasks test pure aesthetic judgment; Page tasks test product instinct.
If a model can ship the Game tasks cleanly, it's safe to wire into an agent loop. If it can ship Visual cleanly, it has design taste — useful for content workflows. The benches you read about online (HumanEval, MMLU, SWE-bench) measure a different thing entirely; this bench measures what shows up when you ask a frontier model for a working artifact.
How I add a new task
A new task lands when one of three things happens: (1) a new frontier model ships and an existing task needs an updated baseline; (2) someone inside the AI Profit Boardroom proposes one and we agree it's a useful test; (3) I'm building something with the Agent OS and the prompt itself is novel enough to deserve a permanent cell.
The bar is: would a working operator try the prompt themselves? If yes — it's a task. If it's a synthetic stress test of one capability, it's probably better measured by HumanEval-style benchmarks; that's not what Goldie Bench is for.
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.