Category

AI Visual benchmarks

8 visual tasks where every frontier AI model gets the same one-shot prompt. Live, playable demos. Real 0–10 scores from Julian Goldie.

Top model in this category
GLM-5.2 →

What I'm testing in the Visual category

Visual tasks (aurora, fireworks, lava lamp, voxel landscape, synthwave) are pure aesthetic tests. There's no game loop to debug — just whether the model ships a polished, on-brand visual on the first try. GLM-5.2 tends to dominate here. Opus is consistent. Kimi plays plainer on these than on game/sim tasks — its bronze average drags from the Visual column.

How I score Visual tasks

Same three axes as the rest of the bench: runs (does the .html open to a working page), hits the brief (is the thing I asked for what came back), looks good (visual polish, motion, attention to detail). 0–10 each, averaged. Highest score on each task earns gold; second silver; third bronze. Models without a 0–10 verdict are listed as unranked on the leaderboard.

Source guides for the Visual category: see the methodology page for full data provenance.

Related

Other categories: Game, Page, Sim · all tasks · all models

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR