AI Sim benchmarks

11 sim tasks where every frontier AI model gets the same one-shot prompt. Live, playable demos. Real 0–10 scores from Julian Goldie.

Top model in this category

Opus 4.8 →

What I'm testing in the Sim category

Simulation tasks are where the cinematic gap shows. A great fluid sim, particle galaxy, or N-body orbit isn't about whether the model writes correct math — every frontier model can do that. It's about whether the model picks the right colour palette, the right damping, the right camera angle on the first try. The Sim category usually surfaces which models 'have taste' versus which models just execute the prompt literally.

Every Sim task on the bench

11 tasks, 41 total demos across all models. Click any task to see how every AI model handled the same prompt — side by side, live and playable.

Sim

Blackhole

Black Hole — gravitational lensing visualisation.

4models

Sim

Boids

Boids — flocking-birds emergent behaviour simulation.

2models

Sim

Cloth

Cloth — physical cloth simulation draping over an object.

2models

Sim

Fluid

Fluid — WebGL fluid simulation with swirling particles.

5models

Sim

Fractal

Fractal — interactive fractal explorer (mandelbrot or julia).

4models

Sim

Galaxy

Galaxy — particle galaxy you can swirl with your mouse.

7models

Sim

Orbit

Orbit — N-body gravitational simulation.

5models

Sim

Pathtracer

Path Tracer — physically-correct ray-traced renderer.

3models

Sim

Reactiondiff

Reaction-Diffusion — Turing pattern generator.

2models

Sim

Solar

Solar — accurate planetary solar system.

6models

Sim

Wormhole

Wormhole — 3D wormhole tunnel flythrough.

1models

How I score Sim tasks

Same three axes as the rest of the bench: runs (does the .html open to a working page), hits the brief (is the thing I asked for what came back), looks good (visual polish, motion, attention to detail). 0–10 each, averaged. Highest score on each task earns gold; second silver; third bronze. Models without a 0–10 verdict are listed as unranked on the leaderboard.

Source guides for the Sim category: see the methodology page for full data provenance.

Other categories: Game, Page, Visual · all tasks · all models

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders

258documented wins

38countries

$100k+/mocommunity MRR

Join AIPB · $59/mo → Read the Agent OS guides →

AI Sim benchmarks

What I'm testing in the Sim category

Every Sim task on the bench

How I score Sim tasks

Related

Run this stack yourself.

Join 3,600+ founders building with this stack.