Anthropic

Opus 4.8

The reasoning king — deepest thinking, premium price.

Context200,000 tokens (1M with extended thinking)

Pricing$15 / $75 per M tokens

Tasks tested47

Avg score7.51/10 average

Medals🥇3 🥈1 🥉1

Release2026-05

Official siteanthropic.com/claude ↗

Official vendor source

Opus 4.8 is built by Anthropic — see the vendor's own product page, pricing, and docs at anthropic.com/claude.

Visit anthropic.com/claude →

Reference benchmarks for Opus 4.8

These are external benchmarks I pulled from the source comparison guides on agentos.guide — SWE-bench Verified, DRACO, Kilo plan rubric, build-time measurements, vendor-reported coding scores. They are not goldiebench medal scores (those come only from same-prompt one-shot creative coding tasks in the matrix). I surface them here so the spec sheet for Opus 4.8 is honest about what's measured.

SWE-bench Verified

88.6%

source: /three-dragons

What is Opus 4.8?

Opus 4.8 is the Anthropic frontier model with a 200,000 tokens (1M with extended thinking) context window, released 2026-05. Tagline: The reasoning king — deepest thinking, premium price.. Official source: anthropic.com/claude.

Pricing detail. Premium pricing via the Anthropic API: $15 per million input tokens, $75 per million output tokens. Extended thinking is included but adds latency.

How I use it inside the Agent OS. The default when the build has to ship on the first prompt — Opus is the safety net inside Agent OS for hard one-shots.

What I built with Opus 4.8

Every model on Goldie Bench gets the same fixed prompt set — one shot, single HTML file out — and I score the result 0–10 inside the Agent Operating System. Here's what Opus 4.8 shipped on the bench: 47 one-shot demos across 200,000 tokens (1M with extended thinking) of context. Of those, 47 are scored against the field with my honest 0–10 from the source guides at agentos.guide.

Strengths

Most consistent across the Goldie Bench bench — no weak build, 8.46/10 average
Deepest one-shot reasoning, especially on game-feel and physics
Extended thinking mode handles up to 1M tokens of context

Trade-offs

5–10× the per-token cost of every other model on the bench
Less flair on cinematic visuals than GLM-5.2 — playing it safer wins on accuracy, costs you on showpiece moments

Best for

Mission-critical one-shot builds where 'has to work the first time' matters
Hard reasoning tasks (planning, multi-step) where you'll pay for the depth
Anything where vendor reliability beats the per-token bill

Every benchmark — Opus 4.8's full scorecard

All 47 scored tasks, best first — the judge's 0–10 on the same rubric as the whole field. Click any bar for that task's cross-model page, or open this scorecard in the interactive graphs. Full editorial breakdown with judge quotes and sourced outside research: the Opus 4.8 deep dive →.

Every demo by Opus 4.8

47 live demos, sorted by category. Click any tile to play the actual one-shot result. Verdicts and 0–10 scores are pulled from the source guides where I posted them publicly.

Opus 4.8

Reference benchmarks for Opus 4.8

What is Opus 4.8?

What I built with Opus 4.8

Strengths

Trade-offs

Best for

Every benchmark — Opus 4.8's full scorecard

Every demo by Opus 4.8

Compare Opus 4.8 against every other model

Quick pill index

Opus 4.8 — frequently asked

Run this stack yourself.