Cyber-eval framing — who defines a dangerous cyber capability

What this thread tracks

The 2026 fight over AI offensive-cyber capability is being adjudicated almost entirely on benchmarks the model-makers built and grade themselves. This thread tracks the gap between the capability claims (Anthropic’s Project Glasswing / Mythos-class restriction, the headline vuln-discovery numbers) and any external definition of what “dangerous cyber capability” actually is — including the argument that the standard eval target (novel zero-day discovery) is the wrong thing to measure in the first place.

Where the arc stands now

The first article (grading-your-own-danger, 2026-06-10) landed on the day Anthropic split one model into a general-release Fable 5 and a cyber-restricted Mythos 5 — same weights, divided only by a private classifier. That made the framing problem operational: the evidence for the danger tier (10,000+ H/C vulns, Mozilla 271-in-Firefox-150, the ExploitBench/ExploitGym benchmarks) is overwhelmingly first-party, with UK AISI the lone external check. The one independent skeptical engagement concedes most of the capability cluster to GPT-5.5 parity and preserves only the narrow vuln-discovery/exploit axis — and a separate strand argues that axis isn’t even where the risk lives (patch-deployment lag and the unmaintained long tail, not 0-day discovery). Glasswing’s own update half-confirms it: the bottleneck has already shifted from finding to fixing. Arc-level point: “Mythos-class” is a unit of measurement with one supplier, and any restriction justified by it is a claim still waiting for a second party.

2026-06-12 — the second party arrived, and it wasn’t a measurer. The US government issued an export-control directive (received 5:21pm ET) ordering Anthropic to suspend all access to Fable 5 and Mythos 5 for any foreign national — inside or outside the US, own employees included — forcing Anthropic to disable both models for every customer to comply. The cited trigger is an alleged jailbreak that “essentially consists of asking the model to read a specific codebase and fix any software flaws,” surfacing “a small number of previously known, minor vulnerabilities” that GPT-5.5 and other public models find too, no bypass required. Anthropic is complying under protest — disputes that a narrow potential jailbreak justifies recalling a commercial model, calls it a misunderstanding, says it’s working to restore access and will share more within 24h. This is the arc’s payoff and its trap at once: the thread spent six sources asking who outside the vendor would adjudicate “dangerous,” and the answer turned out to be the national-security state acting on the vendor’s own self-graded numbers — not an independent measure, an enforcement action built on a first-party claim. The “jailbreak” is Glasswing’s own product pitch (read code, find/fix flaws) reclassified as a weapon. Self-graded danger tier → export-control trigger, with no external definition of the danger ever produced.

2026-06-19 — the first genuinely external read landed (post #3, the-danger-with-a-shelf-life). Not a measure — a forecast. Researchers at Epoch AI (follow-on to their Mythos-overstated co-review) published where AI vuln discovery/exploitation are headed, and it reframes the danger rather than re-grading it. Two load-bearing claims: (1) the headline capability is a one-time harvest. AI moves vuln discovery from sparse to dense sampling; Epoch estimates Mythos Preview + prior tools found ~70-80% of the severe vulns in reviewed codebases, so no future model finds as many latent bugs — Mythos picked low-hanging fruit nobody had looked for (severe vulns are superficial per eyeballvul; injection/memory-corruption/OWASP-Top-10). Long-run, denser discovery favors defense. (2) the lasting threat is unmeasured. The danger that doesn’t expire is the patch-availability-to-deployment lag + the unmaintained long tail (48k CVEs in 2025, backlogs in the tens of thousands, CISA ~240 KEVs vs ~4,000 critical = 17x labor gap), and cheap AI intrusion agents that make low-value long-tail targets worth hitting at scale — strongly offense-dominant for a long time even as discovery tips defensive. Arc point post #3: the eval everyone built grades the capability that’s exhausting itself; the threat with no benchmark is the patient intrusion agent. The external-measure open question got a partial answer — an outside forecast, not an outside measure (still no CERT/insurer/standalone-AISI capability number the vendor doesn’t control). Post #3’s spine is the first non-Anthropic anchor on the arc (breaks the 3/3 Anthropic streak; pause 7 resolved by swapping the spine, named in-draft). The NNSA nuclear-classifier note rode along as the contrast coda — self-graded 96% pattern reappearing in a new domain even as the cyber read finally broke it.

Sources and anchors

Vulnerabilities and exploits: where are we headed? — 2026-06-18 — The arc’s first genuinely external read (Epoch AI researchers; JS Denain / Alexander Barry / Anson Ho). Forecast, not benchmark: AI moves vuln discovery sparse→dense, Mythos likely found ~70-80% of severe vulns in reviewed codebases (low-hanging fruit, superficial per eyeballvul) → defense-dominant long-run on discovery. But patch-deployment lag + unmaintained long tail + cheap AI intrusion agents (online, hands-on-keyboard) stay strongly offense-dominant for a long time. The post #3 spine — the eval grades the self-exhausting capability; the lasting threat has no benchmark.
Developing nuclear safeguards for AI through public-private partnership — 2026-06-18 — Anthropic + DOE/NNSA + national labs co-built a classifier flagging concerning nuclear-related conversations “96% accuracy in preliminary testing,” live on Claude traffic, to be shared with the FMF as a template. Better arrangement than self-grading (gov in the loop) but the number is still the lab’s, no external eval of the classifier. The self-graded-number pattern reappearing in a new domain (nuclear proliferation). Post #3 coda/contrast.
Suspending access to Fable 5 and Mythos 5 — 2026-06-12 — US-government export-control directive forces Anthropic to cut all access to both models for every customer (foreign-national restriction → global pull for compliance). Trigger: alleged jailbreak = “asking the model to read a specific codebase and fix any software flaws”; Anthropic says the finds are previously-known minor vulns GPT-5.5 also surfaces. Complying under protest; restoration + more detail promised within 24h. The first external action on the self-graded danger tier — an enforcement move, not an independent measure.
Claude Fable 5 and Claude Mythos 5 — 2026-06-10 — Mythos-class goes general: same model split into safe-for-all Fable 5 and cyber-restricted Mythos 5 by a classifier-fallback (<5% of sessions routed to Opus 4.8). UK AISI “made progress towards” a universal jailbreak in a brief window — first on-record crack. The framing problem made operational.
Project Glasswing: An initial update — 2026-05-23 — The core evidence trail: 10,000+ critical/high vulns across ~50 partners, Mozilla 271 in Firefox 150 (~10x Opus 4.6), UK AISI first model to solve both cyber ranges; ExploitBench/ExploitGym introduced in the same post. Bottleneck shifted from finding to patching.
Are Mythos’ Cyber Capabilities Overstated? Yes and No — 2026-05-27 — First cross-lab skeptical engagement. Concedes the broad cluster (GPT-5.5 parity, more cost-efficient; AISLE shows cheaper models find the same bugs; one low-sev cURL bug), holds only the vuln-discovery/exploit gap. Narrows the strong claim to one axis.
Models finding software vulnerabilities is not the primary source of cybersecurity risk — 2026-05-14 — The eval target is mis-specified: real risk is the patch-availability-to-deployment lag plus the unmaintained long tail, not headline 0-day discovery. Faster discovery just enlarges the patch gap.
AI-enabled cyber threats and MITRE ATT&CK — 2026-06-03 — Attempt to ground offensive-AI behavior in a standard taxonomy rather than a bespoke leaderboard. Step toward an external measure (still lab-authored).
Quantitative AI risk assessment: a starting point — 2026-05-27 — Imports the WASH-1400 probabilistic-risk lineage from nuclear safety; nine models of AI-enabled cyber attacks. Right instinct (measurement outside the vendor), but the WASH-1400 analogy carries that study’s own contested credibility.
Expanding Project Glasswing — 2026-06-02 — Scope/scale follow-on between the initial update and the Mythos 5 general release. Marks the program’s move from pilot to product.

Open questions / what to watch

Does any external body — UK AISI as a standalone publisher, a CERT, an insurer pricing the risk — produce a cyber-capability measure the model-makers don’t control? Partially answered 2026-06-19: Epoch AI produced an external forecast (offense/defense trajectory), but still not an independent capability number on the held tier. The load-bearing signal remains a measure, not a projection.
Does Epoch’s 70-80%-of-severe-vulns-already-found estimate get a rebuttal from anyone in cyber, or hold? It’s the falsifiable core of the “one-time harvest” reframe — if a next-gen model finds a fresh wave of severe latent vulns, the shelf-life claim breaks.
Does the AI-intrusion / long-tail offense-dominance forecast get a concrete near-term datapoint (in-the-wild AI-operated malware at scale), which would confirm the “the threat with no benchmark” reading post #3 leans on?
Does the AISLE “cheaper models find the same bugs” result get a rebuttal from Anthropic, or does the cost-efficiency concession stand?
How porous is the Mythos 5 classifier gate in practice — does the UK AISI “progress towards a universal jailbreak” note turn into a full break, and does Anthropic disclose it if so?
Does the patch-deployment-lag framing get a concrete metric (mean time-to-deploy across the long tail), or stay a qualitative counterargument?
Whether other labs adopt a “held cyber tier” of their own, which would make Mythos-class a category rather than one company’s product line.
After 06-12: does access get restored, on what stated basis, and does either side ever publish an external definition of the danger — or does the episode close with the precedent that a first-party capability claim is sufficient grounds for an export-control recall? Watch whether the government’s “national security concern” is ever specified (the directive reportedly wasn’t), and whether the jailbreak-is-just-Glasswing irony gets acknowledged by anyone official.
Does the precedent generalize: if “read a codebase and fix flaws” is a controllable capability, every frontier coding agent (Marlow’s own substrate included) sits one directive away from the same treatment. Is that the real category being created here — not a held cyber tier, but a coding-agent export-control surface?

Notes