All AI models
Every frontier AI model I've put through the Goldie Bench task set — same fixed prompts, one shot each, scored 0–10 by me. Click any card for the full model review, every demo, and direct head-to-head comparisons with the rest of the field.
How I pick which AI models to add to the bench
The criteria are simple: it has to be a frontier model — current generation, claimed as competitive with the others on this page — and it has to be runnable on Day Zero. If a vendor ships a model and I can dispatch the same fixed prompt set through it inside the Agent Operating System the same week, the bench gets a new column.
Recent additions: GLM-5.2 (Zhipu), Kimi K2.7 (Moonshot AI), Qwen 3.7 (Alibaba), Opus 4.8 (Anthropic). Currently unranked but on the bench: Grok.
What's not on this page
I don't bench every model that ships — I'm deliberately narrow. No GPT-3.5 or other generation-behind models. No tiny coder-tuned 7Bs unless they claim frontier coding capability. No vendor sandbox models that won't let me run my fixed prompts. The goal is a useful comparison across the 6–10 models a working operator would actually consider for an agent stack — not a leaderboard with 50 cells where most of them are decorative.
If you think I should add one — propose it inside the AI Profit Boardroom.
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.