Best AI model for Doom
Doom — put monsters in the raycaster maze and let them chase you.
The prompt — what I asked each model
Every model on this page got the same fixed prompt inside the Agent Operating System: Doom — put monsters in the raycaster maze and let them chase you.
Single HTML file out. No iteration. No "best of N." No examples in the system prompt. Whatever each model produced on the first run is what's on this page. 3 models have attempted it so far — Kimi K2.7, Opus 4.8, GLM-5.2.
What counts as winning here
This is the game category on Goldie Bench. The question isn't "did the model write code that compiles" — the question is "did the model ship a thing you'd actually use." For Doom that means three things, in order:
- Does it run? Drop the .html file in a browser. If it opens to a broken page, it scores zero on the first axis.
- Did it hit the brief? The prompt asks for a specific thing. A model that ships a different thing — however polished — gets docked on the brief axis.
- Does it look good? Visual polish, motion, interactivity, attention to detail. This is where the difference between gold and silver usually lives.
Final score is my honest 0–10 across all three axes, averaged. Across the 3 models I've scored on this task so far, the average score is 8.33/10.
Every model's attempt — ranked by my 0–10 score
Models ranked by medal (highest score = 🥇 gold, second = 🥈 silver, third = 🥉 bronze). Click any tile to play that model's actual one-shot HTML.
▶ LIVE
▶ LIVE
▶ LIVEHow I tested this — the methodology in 60 seconds
Every comparison on Goldie Bench follows the same recipe:
- I pick a creative coding prompt that a frontier model should be able to ship in one shot.
- I dispatch the exact same prompt to each model from the kanban inside the Agent Operating System.
- I save whatever .html file the model produced on the first run. No iteration. No coaching.
- I score each result 0–10 on my three axes (runs / hits the brief / looks good).
- I publish the scores publicly in the source comparison guides on agentos.guide — and that's what feeds this page.
See the methodology page for full data provenance, including which source guides each cell's score came from.
Related tasks and comparisons
More tasks in the Game category · All attempts on Doom · Back to the leaderboard
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.