What's the prompt for the Voxel test?

Voxel — voxel-art landscape (Minecraft-style). Every model receives this exact prompt, one shot, single HTML file out.

Visual

Voxel

Q: What's the best AI model for Voxel?

Fugu Ultra — Ultra v2 — Temple-Run voxel runner. Smoke-test PASS with 32.7% pixel diff — the single most reactive build. Replaces the earlier truncated voxel-fugu that was deleted.

Q: How many AI models attempted Voxel?

23 models on Goldie Bench have attempted Voxel: Claude Fable 5, Fugu Ultra, Fugu Mini, Fusion, Gemini 3.6 Flash, GLM-5.2, GPT-5.6 Sol, Grok, Inkling, Kimi K2.7, Kimi K3, MiniMax M3, Hermes MoA, Opus 4.8, Claude Opus 5, Qwen 3.8, Qwen 3.7, Claude Sonnet 5, Kimi K2.7 · Fast, Kimi K2.7 · No-Think, Kimi K2.7 · Quality, DeepSeek V4 Pro, DeepSeek V4 Flash.

Voxel — voxel-art landscape (Minecraft-style).

CategoryVisual

Models tested23

Scored18/23

Avg score7.71/10

WinnerFugu Ultra

What I asked each model — the Voxel prompt

Every model on this page got this exact prompt inside the Agent Operating System: Voxel — voxel-art landscape (Minecraft-style).

Single HTML file out. No iteration. No examples in the system prompt. Whatever each model produced on the first run is what's on this page. 23 frontier models have attempted it so far: Claude Fable 5, Fugu Ultra, Fugu Mini, Fusion, Gemini 3.6 Flash, GLM-5.2, GPT-5.6 Sol, Grok, Inkling, Kimi K2.7, Kimi K3, MiniMax M3, Hermes MoA, Opus 4.8, Claude Opus 5, Qwen 3.8, Qwen 3.7, Claude Sonnet 5, Kimi K2.7 · Fast, Kimi K2.7 · No-Think, Kimi K2.7 · Quality, DeepSeek V4 Pro, DeepSeek V4 Flash.

Why this task matters. Voxel is a textbook test of visual-class capability — the kind of build that exposes whether a model is doing pattern-matching or actual reasoning. Shipping this cleanly is the floor for what I expect from a frontier model — every model on the leaderboard should at least attempt it.

How each model handled Voxel

Ranked by my 0–10 score from the source comparison guides on agentos.guide. Click any to play the actual one-shot HTML the model produced.

Claude Fable 5 Anthropic

• 8.5/10

What I saw: Iterated rebuild holds its excellent standard: a gorgeous floating voxel island with grass/stone/snow biomes, trees, ponds, drifting clouds and colorful flying birds, a day/night toggle and a bird counter. Orbit/zoom/glide controls respond (verified). Terrain reads a touch flat, keeping it just shy of the very top.

▶ Play Claude Fable 5's attempt →

Fugu Ultra Sakana AI

🥇 9.0/10 · winner · voxel runner

What I saw: Ultra v2 — Temple-Run voxel runner. Smoke-test PASS with 32.7% pixel diff — the single most reactive build. Replaces the earlier truncated voxel-fugu that was deleted.

▶ Play Fugu Ultra's attempt →

Fugu Mini Sakana AI

• 8.5/10

What I saw: Mini gap-fill (round 2) — Temple-Run voxel runner. Smoke-test PASS with 17.0% pixel diff — works this time (the earlier Mini voxel was STATIC and got deleted).

▶ Play Fugu Mini's attempt →

Fusion OpenRouter

🥇 9.0/10 · best in class · full game

What I saw: Closest thing to a real Temple Run any model has shipped: 3-lane runner with chunk streaming, jump + slide mechanics, coins, hurdles, gates, increasing speed, score/coins/speed/best HUD pills, touch-swipe support, gradient-text overlay card. Other voxel attempts were visuals only — this one's playable front to back.

▶ Play Fusion's attempt →

Gemini 3.6 Flash Google

• 4.5/10

What I saw: Strong polished UI and clean voxel intent, but the render is broken: the landscape is a flat cyan-water slab with no visible terrain, grass, or trees — only floating white cloud/snow blocks, so it reads as a washed-out empty plane rather than a Minecraft-style landscape.

▶ Play Gemini 3.6 Flash's attempt →

GLM-5.2 Zhipu / Z.ai

🥇 9.0/10 · winner · flair

What I saw: GLM built the densest, most detailed city — windowed skyscrapers, a speed + coins HUD. Opus ran the furthest with the cleanest motion (Score 303). Kimi's runner plays fine but is unforgiving — it crashes within seconds.

▶ Play GLM-5.2's attempt →

GPT-5.6 Sol OpenAI

• 3.5/10

What I saw: The HUD, water plane, and clouds render cleanly with polished branding, but the entire voxel island terrain is rendered as an unlit black silhouette — likely a lighting/vertexColor/instanceColor failure — which guts the core Minecraft-style landscape brief.

▶ Play GPT-5.6 Sol's attempt →

Grok xAI

• 8.5/10

What I saw: A colourful 3D voxel city with a score and coins HUD and a polished game-over card. Like every runner it ends fast — but the build is excellent.

▶ Play Grok's attempt →

Inkling Thinking Machines

• 7.2/10

What I saw: Renders a clean 3D voxel terrain with decent lighting, shadows, and polished UI overlay, but the random per-cube color scattering reads as noisy rather than coherent Minecraft-style biomes (grass/dirt/stone layers), and the drag-rotate control is disconnected from the actual auto-orbiting camera—solid but generic.

▶ Play Inkling's attempt →

Kimi K2.7 Moonshot AI

• 6.0/10

▶ Play Kimi K2.7's attempt →

The winner on Voxel

Fugu Ultra took gold on this task. winner · voxel runner.

What I saw: Ultra v2 — Temple-Run voxel runner. Smoke-test PASS with 32.7% pixel diff — the single most reactive build. Replaces the earlier truncated voxel-fugu that was deleted.

See Fugu Ultra's full model card: /models/fugu.

Every attempt — live, playable

Side by side. Click any tile to run that model's actual one-shot HTML in a new tab.

Voxel

What I asked each model — the Voxel prompt

How each model handled Voxel

The winner on Voxel

Every attempt — live, playable

How I scored Voxel — methodology

Related

Run this stack yourself.