AI Other benchmarks
3 other tasks where every frontier AI model gets the same one-shot prompt. Live, playable demos. Real 0–10 scores from Julian Goldie.
What I'm testing in the Other category
The Other category covers 3 one-shot creative coding tasks where the model has to ship a complete, working other build from a single fixed prompt.
Every Other task on the bench
3 tasks, 3 total demos across all models. Click any task to see how every AI model handled the same prompt — side by side, live and playable.
How I score Other tasks
Same three axes as the rest of the bench: runs (does the .html open to a working page), hits the brief (is the thing I asked for what came back), looks good (visual polish, motion, attention to detail). 0–10 each, averaged. Highest score on each task earns gold; second silver; third bronze. Models without a 0–10 verdict are listed as unranked on the leaderboard.
Source guides for the Other category: see the methodology page for full data provenance.
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.