3 models · strictly on-device · nothing leaves your Mac

Local AI models.

The local end of the bench — strictly on-device AI coding models. Every model on this page runs 100% on your own hardware via Ollama, with no API key, no per-token bill, no internet round-trip, and nothing leaving the machine. Source code under MIT or Apache-2.0. Same fixed prompt set as the cloud frontier bench. Honest 0–10 scores. Real live demos.

Strictly on-device only — free models hosted via cloud APIs (e.g. North Mini Code via OpenRouter) live on the main bench, not here. The line is whether inference happens on your machine.

The local leaderboard

Ranked by Julian's actual 0–10 average across all tasks scored. Models still benching show pending.

Rank	Model	Runner	Size	License	Tasks	Avg / 10	Medals
#1	Gemma-4 12B Coder Google (Gemma · open-weights, runs local)	Ollama	7.4GB	Apache-2.0	6	4.25/10	🥇0 🥈0 🥉0
#2	Ornith 1.0 DeepReinforce (MIT-licensed agentic coder)	Ollama	9.5GB	MIT	15	pending	🥇0 🥈0 🥉0
#3	Qwythos 9B Richard Young · DeepNeuro (abliterated build of empero-ai's Qwythos, Qwen3.5 base)	Ollama	5.6GB	Open weights (Qwen3.5 base)	0	pending	🥇0 🥈0 🥉0

Why a separate local leaderboard?

Frontier-cloud models (Opus, GLM, Kimi, Fusion, MiniMax) are competing on raw capability — pay-per-token, hosted on someone else's GPU. The bench shows the absolute ceiling.

Local models compete on a different axis: what's the best build I can ship if my whole stack is free, offline, and private? Different question. Different answer. Different shortlist.

A 9B model running on a consumer Mac will rarely beat a frontier 100B+ cloud model on a one-shot ambitious build. But a 9B Apache-licensed coder you can run forever for $0 — that wins a different fight: the daily-driver fight. This page is the honest scoreboard for that fight.

Every local model on the bench

Click any card for the full model review, every demo, the vendor link, and head-to-head comparisons.

Gemma-4 12B Coder Google (Gemma · open-weights, runs local)

The free, offline coder — trained only on code that passed its tests.

4.25avg

6tasks

0🥇

0🥈

Ornith 1.0 DeepReinforce (MIT-licensed agentic coder)

Local agentic coder that learned to write its own task harness — the small one for daily work.

Qwythos 9B Richard Young · DeepNeuro (abliterated build of empero-ai's Qwythos, Qwen3.5 base)

A Claude-style creative & reasoning 9B with a full 1M-token context — the local writer & thinker.

How I run them — the stack

Ollama on macOS for the on-device models (Ornith 1.0, Gemma-4 12B Coder, Qwythos). Q4_K_M or Q8_0 GGUF builds, 5–10 GB each.
Hermes profiles wire each model into the Agent OS so it's dispatchable from the kanban with the same workflow as the cloud models — ~/.hermes/profiles/ornith/, ~/.hermes/profiles/north-mini/.
Same fixed prompt set as the frontier-cloud bench — solar system, raycaster, Skyrim open world, Web-OS desktop, the lot. No special-casing for "weaker" models.
Same scoring rubric — Claude judge against the same one-shot prompts every other model ran. No grading on a curve.

Full setup live in the Agent Operating System inside the AI Profit Boardroom.

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders

258documented wins

38countries

$59/momonthly

Join AIPB · $59/mo → Read the Agent OS guides →