Best AI model for…

Best AI model for Blackhole

Black Hole — gravitational lensing visualisation.

🥇
Best model for Blackhole
Opus 4.8
"Opus nailed it — a pure-black event horizon, a bright photon ring, and the disk bent up and over the top exactly like the film's lensing. GLM came in strong with a clean ring and a starfield warping past the hole. Kimi's disk is fine, but the background is a soft grey blur instead of stars. This one's Opus's."
See Opus 4.8's full track record →

The prompt — what I asked each model

Every model on this page got the same fixed prompt inside the Agent Operating System: Black Hole — gravitational lensing visualisation.

Single HTML file out. No iteration. No "best of N." No examples in the system prompt. Whatever each model produced on the first run is what's on this page. 4 models have attempted it so far — Opus 4.8, GLM-5.2, Kimi K2.7, Grok.

What counts as winning here

This is the sim category on Goldie Bench. The question isn't "did the model write code that compiles" — the question is "did the model ship a thing you'd actually use." For Blackhole that means three things, in order:

  1. Does it run? Drop the .html file in a browser. If it opens to a broken page, it scores zero on the first axis.
  2. Did it hit the brief? The prompt asks for a specific thing. A model that ships a different thing — however polished — gets docked on the brief axis.
  3. Does it look good? Visual polish, motion, interactivity, attention to detail. This is where the difference between gold and silver usually lives.

Final score is my honest 0–10 across all three axes, averaged. Across the 3 models I've scored on this task so far, the average score is 7.67/10.

How I tested this — the methodology in 60 seconds

Every comparison on Goldie Bench follows the same recipe:

  1. I pick a creative coding prompt that a frontier model should be able to ship in one shot.
  2. I dispatch the exact same prompt to each model from the kanban inside the Agent Operating System.
  3. I save whatever .html file the model produced on the first run. No iteration. No coaching.
  4. I score each result 0–10 on my three axes (runs / hits the brief / looks good).
  5. I publish the scores publicly in the source comparison guides on agentos.guide — and that's what feeds this page.

See the methodology page for full data provenance, including which source guides each cell's score came from.

Related tasks and comparisons

More tasks in the Sim category · All attempts on Blackhole · Back to the leaderboard

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR