Anthropic

Opus 4.8

The reasoning king — deepest thinking, premium price.

Context200,000 tokens (1M with extended thinking)
Pricing$15 / $75 per M tokens
Tasks tested17
Avg score8.46/10 average
Medals🥇8 🥈5 🥉0
Release2026-05

What is Opus 4.8?

Opus 4.8 is the Anthropic frontier model with a 200,000 tokens (1M with extended thinking) context window, released 2026-05. Tagline: The reasoning king — deepest thinking, premium price..

Pricing detail. Premium pricing via the Anthropic API: $15 per million input tokens, $75 per million output tokens. Extended thinking is included but adds latency.

How I use it inside the Agent OS. The default when the build has to ship on the first prompt — Opus is the safety net inside Agent OS for hard one-shots.

What I built with Opus 4.8

Every model on Goldie Bench gets the same fixed prompt set — one shot, single HTML file out — and I score the result 0–10 inside the Agent Operating System. Here's what Opus 4.8 shipped on the bench: 17 one-shot demos across 200,000 tokens (1M with extended thinking) of context. Of those, 13 are scored against the field with my honest 0–10 from the source guides at agentos.guide.

Strengths

  • Most consistent across the Goldie Bench bench — no weak build, 8.46/10 average
  • Deepest one-shot reasoning, especially on game-feel and physics
  • Extended thinking mode handles up to 1M tokens of context

Trade-offs

  • 5–10× the per-token cost of every other model on the bench
  • Less flair on cinematic visuals than GLM-5.2 — playing it safer wins on accuracy, costs you on showpiece moments

Best for

  • Mission-critical one-shot builds where 'has to work the first time' matters
  • Hard reasoning tasks (planning, multi-step) where you'll pay for the depth
  • Anything where vendor reliability beats the per-token bill

Every demo by Opus 4.8

17 live demos, sorted by category. Click any tile to play the actual one-shot result. Verdicts and 0–10 scores are pulled from the source guides where I posted them publicly.

Head-to-heads with Opus 4.8

Direct comparisons against every other scored model on the bench:

Opus 4.8 vs GLM-5.2 Opus 4.8 vs Qwen 3.7 Opus 4.8 vs Kimi K2.7

Read more on agentos.guide: /opus-ultracode, /claude-fable-5, /glm-vs-kimi-vs-opus, /glm-vs-qwen-vs-opus

The same stack Julian uses

Run this stack yourself.

Every demo on this bench was built inside the Agent Operating System — one prompt, one shot, single HTML file out. The Agent OS, the prompts, the templates, the weekly walkthroughs and 3,600+ founders shipping with it every day all live inside the AI Profit Boardroom.

3,600+founders
258documented wins
38countries
$100k+/mocommunity MRR