How I pick which AI models to add to the bench

The criteria are simple: it has to be a frontier model — current generation, claimed as competitive with the others on this page — and it has to be runnable on Day Zero. If a vendor ships a model and I can dispatch the same fixed prompt set through it inside the Agent Operating System the same week, the bench gets a new column.

Recent additions: GLM-5.2 (Zhipu), Kimi K2.7 (Moonshot AI), Qwen 3.7 (Alibaba), Opus 4.8 (Anthropic). Currently unranked but on the bench: Grok.

What's not on this page

I don't bench every model that ships — I'm deliberately narrow. No GPT-3.5 or other generation-behind models. No tiny coder-tuned 7Bs unless they claim frontier coding capability. No vendor sandbox models that won't let me run my fixed prompts. The goal is a useful comparison across the 6–10 models a working operator would actually consider for an agent stack — not a leaderboard with 50 cells where most of them are decorative.

If you think I should add one — propose it inside the AI Profit Boardroom.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR