Every AI model on the bench

All AI models

Every frontier AI model I've put through the Goldie Bench task set — same fixed prompts, one shot each, scored 0–10 by me. Click any card for the full model review, every demo, and direct head-to-head comparisons with the rest of the field.

Fusion OpenRouter

Multi-model panel — Fable 5 + GPT-5.5, ensembled. Beats Fable 5 at half the price.

8.59avg

47tasks

21🥇

3🥈

Claude Opus 5 Anthropic

The new Anthropic flagship — benched on all 45 one-shot builds the day it landed.

8.27avg

50tasks

13🥇

7🥈

Hermes MoA Hermes · Mixture of Agents

A panel of frontier models, merged by a chair. The model doesn't matter — the system does.

8.17avg

OpenAI's flagship — the Sun of the 5.6 lineup.

8.16avg

50tasks

3🥇

8🥈

Claude Fable 5 Anthropic

The newest Anthropic model — first Mythos-class made generally available.

8.10avg

Alibaba's 2.4T flagship — benched through Qoder.

8.10avg

Snappy + real-time — the X-native model.

8.09avg

1M-context frontier model at $0.30/M tokens — cheapest big-context model on the bench.

7.97avg

Sakana's multi-agent answer to Fusion — frontier ensemble without single-vendor risk.

7.94avg

Moonshot's 2.8T flagship — 1M context, tuned for long-horizon agent work.

7.89avg

The never-forgets agent — 1M context, open weights.

7.77avg

Fugu's fast mini variant — single model, no panel, ~3 min per build.

7.75avg

The reasoning king — deepest thinking, premium price.

7.51avg

47tasks

3🥇

1🥈

Kimi K2.7 Moonshot AI

The heavy lifter — frontier coder at flat-rate.

7.46avg

47tasks

1🥇

2🥈

Gemini 3.6 Flash Google

Google's launch-day Flash — faster, cheaper, fewer tokens.

7.08avg

50tasks

1🥇

2🥈

Claude Sonnet 5 Anthropic

The agentic SWE frontier — 82% SWE-bench Verified, Dev Team mode.

7.01avg

Multilingual open-weights — strong on Chinese reasoning.

7.00avg

47tasks

0🥇

0🥈

Fugu Ultra 1.1 Sakana AI

Sakana's multi-agent orchestrator, v1.1 — routes experts per request.

6.94avg

24tasks

0🥇

1🥈

Inkling Thinking Machines

A 975B open-weights frontier model — yours to own and run.

6.07avg

The open 1.6T MoE that builds — a frontier coder trained on non-Nvidia ASIC superpods.

8.12avg

Tencent's open-weights coder — Apache-2.0, cheap, beats GLM-5.1 on frontend in Tencent's blind eval.

6.76avg

7tasks

0🥇

0🥈

DeepSeek V4 Flash DeepSeek

DeepSeek's cheap tier, retrained for agents — same size, sharper loops.

DeepSeek V4 Pro DeepSeek

DeepSeek's flagship tier — benched head-to-head against its own cheap Flash.

Kimi K2.7 · Fast Moonshot AI

Fast mode — top speed, minimal thinking.

Kimi K2.7 · No-Think Moonshot AI

Pure execution mode — no chain of thought.

Kimi K2.7 · Quality Moonshot AI

Quality mode — deepest thinking, best output.

Claude Mythos 5 Anthropic

Restricted-access flagship — vetted partners only.

Fable 5-class intelligence at ~59% less. The split-the-cost play.

How I pick which AI models to add to the bench

The criteria are simple: it has to be a frontier model — current generation, claimed as competitive with the others on this page — and it has to be runnable on Day Zero. If a vendor ships a model and I can dispatch the same fixed prompt set through it inside the Agent Operating System the same week, the bench gets a new column.

Recent additions: GLM-5.2 (Zhipu), Kimi K2.7 (Moonshot AI), Qwen 3.7 (Alibaba), Opus 4.8 (Anthropic). Currently unranked but on the bench: Grok.

What's not on this page

I don't bench every model that ships — I'm deliberately narrow. No GPT-3.5 or other generation-behind models. No tiny coder-tuned 7Bs unless they claim frontier coding capability. No vendor sandbox models that won't let me run my fixed prompts. The goal is a useful comparison across the 6–10 models a working operator would actually consider for an agent stack — not a leaderboard with 50 cells where most of them are decorative.

If you think I should add one — propose it inside the AI Profit Boardroom.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 4,000+ founders shipping with it every day all live inside the AI Profit Boardroom.

4,000+founders

258documented wins

38countries

$59/momonthly

Join AIPB · $59/mo → Read the Agent OS guides →