AI Visual benchmarks
8 visual tasks where every frontier AI model gets the same one-shot prompt. Live, playable demos. Real 0–10 scores from Julian Goldie.
What I'm testing in the Visual category
Visual tasks (aurora, fireworks, lava lamp, voxel landscape, synthwave) are pure aesthetic tests. There's no game loop to debug — just whether the model ships a polished, on-brand visual on the first try. GLM-5.2 tends to dominate here. Opus is consistent. Kimi plays plainer on these than on game/sim tasks — its bronze average drags from the Visual column.
Every Visual task on the bench
8 tasks, 16 total demos across all models. Click any task to see how every AI model handled the same prompt — side by side, live and playable.
How I score Visual tasks
Same three axes as the rest of the bench: runs (does the .html open to a working page), hits the brief (is the thing I asked for what came back), looks good (visual polish, motion, attention to detail). 0–10 each, averaged. Highest score on each task earns gold; second silver; third bronze. Models without a 0–10 verdict are listed as unranked on the leaderboard.
Source guides for the Visual category: see the methodology page for full data provenance.
Related
Other categories: Game, Page, Sim · all tasks · all models
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.