Every head-to-head

AI model comparisons

Every pair of models on Goldie Bench compared head-to-head. 120 comparisons total · built from the same fixed one-shot prompts inside the Agent Operating System · scored 0–10 by me where I've published a head-to-head guide on agentos.guide. Click any pair to see the full side-by-side.

Filter by model

Click a model to show only comparisons it appears in. Click again to clear.

All 120 comparisons

Scored head-to-heads listed first. Each card links to the full comparison page with side-by-side matrix + per-task verdicts.

Fusion vs MiniMax M3
42 shared tasks · 42 scored
Verdict: Fusion leads 26–8 with 8 ties
Grok vs Fusion
42 shared tasks · 38 scored
Verdict: Fusion leads 22–8 with 8 ties
Grok vs MiniMax M3
42 shared tasks · 38 scored
Verdict: Grok leads 19–6 with 13 ties
Fusion vs Kimi K2.7
42 shared tasks · 20 scored
Verdict: Fusion leads 14–3 with 3 ties
Grok vs Kimi K2.7
42 shared tasks · 20 scored
Verdict: Grok leads 11–2 with 7 ties
MiniMax M3 vs Kimi K2.7
42 shared tasks · 20 scored
Verdict: MiniMax M3 leads 10–7 with 3 ties
GLM-5.2 vs Grok
31 shared tasks · 13 scored
Verdict: GLM-5.2 leads 6–3 with 4 ties
GLM-5.2 vs Fusion
31 shared tasks · 13 scored
Verdict: Fusion leads 8–3 with 2 ties
GLM-5.2 vs MiniMax M3
31 shared tasks · 13 scored
Verdict: GLM-5.2 leads 8–3 with 2 ties
GLM-5.2 vs Kimi K2.7
31 shared tasks · 13 scored
Verdict: GLM-5.2 leads 8–3 with 2 ties
Opus 4.8 vs GLM-5.2
17 shared tasks · 13 scored
Verdict: Opus 4.8 leads 7–3 with 3 ties
Opus 4.8 vs Grok
17 shared tasks · 13 scored
Verdict: Opus 4.8 leads 8–1 with 4 ties
Opus 4.8 vs Fusion
17 shared tasks · 13 scored
Verdict: Fusion leads 4–2 with 7 ties
Opus 4.8 vs MiniMax M3
17 shared tasks · 13 scored
Verdict: Opus 4.8 leads 11–1 with 1 tie
Opus 4.8 vs Kimi K2.7
17 shared tasks · 13 scored
Verdict: Opus 4.8 leads 10–2 with 1 tie
Fusion vs Gemma-4 12B Coder
6 shared tasks · 6 scored
Verdict: Fusion leads 6–0
Grok vs Gemma-4 12B Coder
6 shared tasks · 6 scored
Verdict: Grok leads 6–0
MiniMax M3 vs Gemma-4 12B Coder
6 shared tasks · 6 scored
Verdict: MiniMax M3 leads 6–0
Kimi K2.7 vs Gemma-4 12B Coder
6 shared tasks · 5 scored
Verdict: Kimi K2.7 leads 4–1
Fugu Ultra vs Kimi K2.7
5 shared tasks · 5 scored
Verdict: Fugu Ultra leads 3–1 with 1 tie
Fusion vs Fugu Ultra
5 shared tasks · 5 scored
Verdict: tied 1–1 with 3 ties
Fusion vs Qwen 3.7
5 shared tasks · 5 scored
Verdict: Fusion leads 5–0
GLM-5.2 vs Fugu Ultra
5 shared tasks · 5 scored
Verdict: Fugu Ultra leads 3–1 with 1 tie
GLM-5.2 vs Qwen 3.7
5 shared tasks · 5 scored
Verdict: GLM-5.2 leads 3–0 with 2 ties
Grok vs Fugu Ultra
5 shared tasks · 5 scored
Verdict: Fugu Ultra leads 2–1 with 2 ties
Grok vs Qwen 3.7
5 shared tasks · 5 scored
Verdict: Grok leads 4–0 with 1 tie
MiniMax M3 vs Fugu Ultra
5 shared tasks · 5 scored
Verdict: Fugu Ultra leads 3–1 with 1 tie
MiniMax M3 vs Qwen 3.7
5 shared tasks · 5 scored
Verdict: MiniMax M3 leads 3–0 with 2 ties
Opus 4.8 vs Fugu Ultra
5 shared tasks · 5 scored
Verdict: Opus 4.8 leads 2–1 with 2 ties
Opus 4.8 vs Qwen 3.7
5 shared tasks · 5 scored
Verdict: Opus 4.8 leads 4–0 with 1 tie
Qwen 3.7 vs Kimi K2.7
5 shared tasks · 5 scored
Verdict: Qwen 3.7 leads 4–0 with 1 tie
GLM-5.2 vs Gemma-4 12B Coder
5 shared tasks · 4 scored
Verdict: GLM-5.2 leads 4–0
Opus 4.8 vs Gemma-4 12B Coder
4 shared tasks · 4 scored
Verdict: Opus 4.8 leads 4–0
Fugu Ultra vs Qwen 3.7
3 shared tasks · 3 scored
Verdict: Fugu Ultra leads 2–1
Fusion vs Fugu Mini
10 shared tasks · 2 scored
Verdict: Fusion leads 2–0
GLM-5.2 vs Fugu Mini
10 shared tasks · 2 scored
Verdict: GLM-5.2 leads 2–0
Grok vs Fugu Mini
10 shared tasks · 2 scored
Verdict: Grok leads 1–0 with 1 tie
Kimi K2.7 vs Fugu Mini
10 shared tasks · 2 scored
Verdict: Kimi K2.7 leads 1–0 with 1 tie
MiniMax M3 vs Fugu Mini
10 shared tasks · 2 scored
Verdict: tied 1–1
Opus 4.8 vs Fugu Mini
10 shared tasks · 2 scored
Verdict: Opus 4.8 leads 2–0
Fugu Ultra vs Gemma-4 12B Coder
2 shared tasks · 2 scored
Verdict: Fugu Ultra leads 2–0
Qwen 3.7 vs Gemma-4 12B Coder
2 shared tasks · 2 scored
Verdict: Qwen 3.7 leads 2–0
Qwen 3.7 vs Fugu Mini
5 shared tasks · 1 scored
Verdict: Qwen 3.7 leads 1–0
Fugu Mini vs Gemma-4 12B Coder
4 shared tasks · 1 scored
Verdict: Fugu Mini leads 1–0
Fugu Ultra vs Fugu Mini
4 shared tasks · 1 scored
Verdict: Fugu Ultra leads 1–0
Fusion vs Kimi K2.7 · Fast
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Fusion vs Kimi K2.7 · No-Think
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Fusion vs Kimi K2.7 · Quality
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Grok vs Kimi K2.7 · Fast
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Grok vs Kimi K2.7 · No-Think
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Grok vs Kimi K2.7 · Quality
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Kimi K2.7 vs Kimi K2.7 · Fast
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Kimi K2.7 vs Kimi K2.7 · No-Think
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Kimi K2.7 vs Kimi K2.7 · Quality
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Kimi K2.7 · Fast vs Kimi K2.7 · No-Think
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Kimi K2.7 · Fast vs Kimi K2.7 · Quality
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Kimi K2.7 · No-Think vs Kimi K2.7 · Quality
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
MiniMax M3 vs Kimi K2.7 · Fast
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
MiniMax M3 vs Kimi K2.7 · No-Think
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
MiniMax M3 vs Kimi K2.7 · Quality
3 shared tasks · 0 scored
3 shared tasks · no curated head-to-head yet
Fugu Mini vs Kimi K2.7 · Fast
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Fugu Mini vs Kimi K2.7 · No-Think
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Fugu Mini vs Kimi K2.7 · Quality
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
GLM-5.2 vs Kimi K2.7 · Fast
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
GLM-5.2 vs Kimi K2.7 · No-Think
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
GLM-5.2 vs Kimi K2.7 · Quality
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Gemma-4 12B Coder vs Kimi K2.7 · Fast
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Gemma-4 12B Coder vs Kimi K2.7 · No-Think
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Gemma-4 12B Coder vs Kimi K2.7 · Quality
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Opus 4.8 vs Kimi K2.7 · Fast
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Opus 4.8 vs Kimi K2.7 · No-Think
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Opus 4.8 vs Kimi K2.7 · Quality
2 shared tasks · 0 scored
2 shared tasks · no curated head-to-head yet
Fugu Ultra vs Kimi K2.7 · Fast
1 shared task · 0 scored
1 shared tasks · no curated head-to-head yet
Fugu Ultra vs Kimi K2.7 · No-Think
1 shared task · 0 scored
1 shared tasks · no curated head-to-head yet
Fugu Ultra vs Kimi K2.7 · Quality
1 shared task · 0 scored
1 shared tasks · no curated head-to-head yet
Claude Fable 5 vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Claude Fable 5 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Claude Mythos 5 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fugu Mini vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fugu Mini vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fugu Mini vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fugu Ultra vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fugu Ultra vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fugu Ultra vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fusion vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fusion vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Fusion vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
GLM-5.2 vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
GLM-5.2 vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
GLM-5.2 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Gemma-4 12B Coder vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Gemma-4 12B Coder vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Gemma-4 12B Coder vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Grok vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Grok vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Grok vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · Fast vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · Fast vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · Fast vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · No-Think vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · No-Think vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · No-Think vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · Quality vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · Quality vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Kimi K2.7 · Quality vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
MiniMax M3 vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
MiniMax M3 vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
MiniMax M3 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Opus 4.8 vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Opus 4.8 vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Opus 4.8 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Qwen 3.7 vs Kimi K2.7 · Fast
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Qwen 3.7 vs Kimi K2.7 · No-Think
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Qwen 3.7 vs Kimi K2.7 · Quality
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Qwen 3.7 vs Claude Fable 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Qwen 3.7 vs Claude Mythos 5
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs
Qwen 3.7 vs Kilo Code
0 shared tasks · 0 scored
Reference-only comparison — see model pages for specs

How to use this page

  • Looking for a specific model? Click its filter pill — the grid will narrow to just its comparisons.
  • Looking for an honest verdict? Pairs marked with "Verdict: X leads N–M" have published scores. The others are still on the bench but unscored yet.
  • New comparison should be here? When I publish a new head-to-head guide on agentos.guide, that pair will jump to the top of this grid with its verdict.
The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR