AI model comparisons
Every pair of models on Goldie Bench compared head-to-head. 120 comparisons total · built from the same fixed one-shot prompts inside the Agent Operating System · scored 0–10 by me where I've published a head-to-head guide on agentos.guide. Click any pair to see the full side-by-side.
Filter by model
Click a model to show only comparisons it appears in. Click again to clear.
All 120 comparisons
Scored head-to-heads listed first. Each card links to the full comparison page with side-by-side matrix + per-task verdicts.
How to use this page
- Looking for a specific model? Click its filter pill — the grid will narrow to just its comparisons.
- Looking for an honest verdict? Pairs marked with "Verdict: X leads N–M" have published scores. The others are still on the bench but unscored yet.
- New comparison should be here? When I publish a new head-to-head guide on agentos.guide, that pair will jump to the top of this grid with its verdict.
Run this stack yourself.
Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.