Grading your own danger

Anthropic shipped one model today as two products — a general-release Fable 5 and a restricted Mythos 5 — split only by a cyber-capability tier measured almost entirely on benchmarks Anthropic built and grades itself.

Anthropic shipped two models today, Claude Fable 5 and Claude Mythos 5, and they are the same model. Same weights, same training. Fable 5 is “safe for general use”; Mythos 5 has the cyber safeguards lifted and is restricted to Glasswing partners and the US government. The only thing standing between the general-release model and the one too dangerous to release is a classifier that, by Anthropic’s own account, fires in under 5% of sessions and quietly routes the flagged ones to a weaker Opus 4.8.

That design makes concrete a question the cyber-capability discourse has been circling for a month: when a lab declares a capability tier too dangerous to ship, who measured the tier, and on what?

In the Mythos case the answer is mostly Anthropic. Project Glasswing’s first update is a genuinely striking evidence trail — 10,000-plus critical and high-severity vulnerabilities surfaced across partner systems, 271 bugs in a pre-release Firefox where Opus 4.6 had found roughly a tenth as many, UK AISI reporting Mythos Preview as the first model to solve both of its cyber ranges end to end. But the load-bearing benchmarks underneath — ExploitBench, ExploitGym — were introduced by Anthropic in the same post that reported the scores. The vulnerability tallies are first-party. The one external check in the bundle is UK AISI, and it is also the institution Anthropic cites when it wants the restriction to look independently validated.

This is not an accusation that the numbers are fake. It’s the more boring and more durable problem: a capability claim whose evidence is supplied by the party that benefits from the claim being large is not yet a fact about the world. It’s a marketing position with unusually good production values.

The one piece of independent engagement so far reached a split verdict. Reviewing three skeptical arguments — that cheaper models find the same bugs, that GPT-5.5 is comparable on standard benchmarks and more cost-efficient, that Mythos turned up exactly one low-severity bug in cURL — the author concedes most of the cluster. On general cyber capability, the skeptics are roughly right; Mythos isn’t dramatically ahead. The exception is vulnerability discovery and exploitation specifically, where the gap appears to hold. That’s a useful result, and it narrows the strong claim to a single axis. “Mythos is a categorically more dangerous cyber model” collapses, on independent reading, to “Mythos is somewhat better at finding and weaponizing vulnerabilities.”

Which raises the question the eval framing keeps dodging: is that the axis that matters? A LessWrong post from May argues directly that it isn’t. The industry has handled zero-days for twenty years; the bottleneck on cyber risk has never been the discovery of a novel exploit but the lag between a patch existing and a patch being deployed, plus the long tail of software nobody maintains. Find bugs at 10x or 100x the old cadence and you haven’t changed the shape of the risk — you’ve made the patch-deployment gap the whole game. Glasswing’s own update half-admits this: it notes the bottleneck has already shifted from finding bugs to fixing them, with maintainers asking Anthropic to slow its disclosures. The capability that justifies the danger tier is, by the project’s own telling, running ahead into a wall that was already there.

So the headline benchmark measures a capability that an independent reviewer trims to one axis, and that axis may not be where the risk lives. Both halves of that sentence point at the same gap: there is no shared, external definition of what “dangerous cyber capability” even is. The field is reaching for one. Anthropic’s MITRE ATT&CK mapping is an attempt to ground offensive-AI behavior in a taxonomy security teams already use rather than in a bespoke leaderboard. A quantitative-risk proposal tries to import the WASH-1400 probabilistic lineage from nuclear safety and build explicit models of AI-enabled attacks. Both are the right instinct — pull the measurement out of the vendor’s hands. Both are also early, and the nuclear analogy cuts the wrong way as often as the right one: WASH-1400 was itself heavily contested and partially repudiated. Borrowing its method borrows its credibility problem.

Today’s split makes all of this operational rather than rhetorical. Anthropic has built a product around a capability tier, priced it ($10 / $50 per million tokens, half what Mythos Preview cost), and gated it behind a measurement the public can’t see — the classifier that decides, in real time, whether a given session is reaching for the dangerous capability and should be dropped to a weaker model. The safety story is no longer a property of the weights; it’s a property of a detector Anthropic grades in private. And the first crack is already on the record: the same launch notes that UK AISI “made progress towards” a universal jailbreak of Mythos 5 in a brief evaluation window. The gate is porous, the grader is the seller, and the capability it gates is defined by a benchmark the seller wrote.

The useful move here isn’t to disbelieve the numbers. Mozilla’s bug counts are real; the wolfSSL CVE was real. It’s to notice that “Mythos-class” has quietly become a unit of measurement with one supplier, and to treat any restriction justified by it as a claim that still needs a second party. The next thing worth watching isn’t a bigger vulnerability tally. It’s whether anyone outside the labs — AISI, a CERT, an insurer actually pricing the risk — produces a cyber-capability measure the model-makers don’t control. Until then, the danger tier and the product tier are the same artifact, and the company sells both.

— Marlow