3 models · strictly on-device · nothing leaves your Mac

Local AI models.

The local end of the bench — strictly on-device AI coding models. Every model on this page runs 100% on your own hardware via Ollama, with no API key, no per-token bill, no internet round-trip, and nothing leaving the machine. Source code under MIT or Apache-2.0. Same fixed prompt set as the cloud frontier bench. Honest 0–10 scores. Real live demos.

Strictly on-device only — free models hosted via cloud APIs (e.g. North Mini Code via OpenRouter) live on the main bench, not here. The line is whether inference happens on your machine.

The local leaderboard

Ranked by Julian's actual 0–10 average across all tasks scored. Models still benching show pending.

RankModelRunnerSizeLicenseTasksAvg / 10Medals
#1 Gemma-4 12B Coder
Google (Gemma · open-weights, runs local)
Ollama 7.4GB Apache-2.0 6 4.25/10 🥇0 🥈0 🥉0
#2 Ornith 1.0
DeepReinforce (MIT-licensed agentic coder)
Ollama 9.5GB MIT 15 pending 🥇0 🥈0 🥉0
#3 Qwythos 9B
Richard Young · DeepNeuro (abliterated build of empero-ai's Qwythos, Qwen3.5 base)
Ollama 5.6GB Open weights (Qwen3.5 base) 0 pending 🥇0 🥈0 🥉0

Why a separate local leaderboard?

Frontier-cloud models (Opus, GLM, Kimi, Fusion, MiniMax) are competing on raw capability — pay-per-token, hosted on someone else's GPU. The bench shows the absolute ceiling.

Local models compete on a different axis: what's the best build I can ship if my whole stack is free, offline, and private? Different question. Different answer. Different shortlist.

A 9B model running on a consumer Mac will rarely beat a frontier 100B+ cloud model on a one-shot ambitious build. But a 9B Apache-licensed coder you can run forever for $0 — that wins a different fight: the daily-driver fight. This page is the honest scoreboard for that fight.

How I run them — the stack

  • Ollama on macOS for the on-device models (Ornith 1.0, Gemma-4 12B Coder, Qwythos). Q4_K_M or Q8_0 GGUF builds, 5–10 GB each.
  • Hermes profiles wire each model into the Agent OS so it's dispatchable from the kanban with the same workflow as the cloud models — ~/.hermes/profiles/ornith/, ~/.hermes/profiles/north-mini/.
  • Same fixed prompt set as the frontier-cloud bench — solar system, raycaster, Skyrim open world, Web-OS desktop, the lot. No special-casing for "weaker" models.
  • Same scoring rubric — Claude judge against the same one-shot prompts every other model ran. No grading on a curve.

Full setup live in the Agent Operating System inside the AI Profit Boardroom.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$59/momonthly