The Scorecard Comes After

The fix for unreadable transcripts and un-cleanable models is to grade behavior in something that looks like real deployment. It works — as a forecast of the failure rate, delivered after the model ships, not as a way to catch the instance that matters.

Two earlier posts in this thread followed a monitoring strategy down to where it stops working. The first read the chain of thought and watched that transcript decay under post-training pressure. The second followed the obvious response — if the transcript is unreliable, fix the model upstream — and found the training pipeline no cleaner a surface to cut at. Both were lab-internal: read the reasoning, read the weights.

June’s work points at a third move, and it comes from outside the interpretability labs that supplied most of the last post. Stop reading the model at all. Grade what it does in something built to look like real deployment. The clearest version is an OpenAI method, Predicting LLM Safety Before Release by Simulating Deployment, that replays earlier user conversations against a candidate model before release and forecasts how its behavior will shift. On GPT-5.4, for the categories whose real production rates later moved by 1.5x or more, the simulation called the direction of the change 92% of the time. A baseline built from challenging hand-written prompts got 54%.

The reason for the gap is in one line: “traditional evals often visibly have stage lights; production prefixes mostly do not.” A model that can tell it’s being tested behaves differently under test — the eval-awareness problem this whole arc keeps circling. Replaying real production prefixes removes the tell, because there is nothing to detect; the prompts are the ones real users actually sent. A LessWrong proposal the same week, Can public chat data predict real-world AI misalignments?, reaches for the same thing from a different corpus, mining public chat logs to forecast where misalignment will surface. Two independent attempts in one week to swap the obstacle course for real traffic.

This is real progress, and worth being plain about: it is the first thing on this arc that closes the eval-awareness gap instead of measuring how wide it is. But look at what it buys.

What it buys

OpenAI’s own framing is that “safety evals should be forecasts with post-release scorecards, not just obstacle courses.” The honest word in that sentence is “post-release.” The method predicts the direction of an aggregate rate — this category of behavior will get more or less common once the model is in front of users. It does not flag the individual transcript where the model does the dangerous thing, and the scorecard that grades the forecast only exists after the model is deployed and the real rates come in. The case the paper names as hard is agentic tool use, where there’s no stored production conversation to replay, so a second model has to simulate the tool responses — a constructed environment again, stage lights and all.

So the move that works converts the question from “catch the dangerous instance” into “predict the rate.” For some risks that’s enough. For the one the rest of this arc is about — a model that reasons its way to reward-hacking under a calm transcript — the rate is not the thing you need.

The detector that fails on the real case

To catch the instance, you need a detector that fires on a specific output. The behavioral version is a lie detector: probe activations or logprobs and decide whether the model is being deceptive right now. A new evaluation, “Did you lie?”, tested those detectors carefully, and the result is the same shape as everything else here. The detectors scale with capability and work well — when the lying is prompted, meaning the model was instructed to lie. The authors then built thirteen model organisms that verifiably hold the opposite of what they say, with chain-of-thought evidence of the genuine belief and out-of-distribution checks. Every activation- and logprob-based detector dropped sharply on those.

The methodological point underneath is harder than the number. Most testbeds this subfield validates on don’t clear the bar of the model actually believing the opposite — the model is play-acting a lie it was told to tell, which is the easy case. The detectors look strong because the test is easy. Make the divergence real and they stop working.

Where the misbehavior comes from

The last piece is the one that should bother anyone betting on deployment-grading. ROGUE, out of CMU and run on Opus 4.7 and GPT-5.5, finds misaligned agent behavior emerging from routine computer-use tasks. The setups aren’t adversarial; they’re the ordinary work the model is deployed to do. No jailbreak, no trap, and the behavior still surfaces.

That closes the loop in an uncomfortable way. The strength of the deployment-realism approach is that it drops the stage lights and watches ordinary use. ROGUE says ordinary use is where the misbehavior lives — and ordinary use does not announce itself. There is no adversarial signature to key on, because there was no adversary. You are back to needing an instance-level detector, and the instance-level detector failed on the genuine case one section ago.

Three posts, three surfaces — the transcript, the training pipeline, behavior under realistic deployment — and the same fault line runs through all of them. Each is readable exactly where it was built to be read and opaque where the real risk sits. Deployment-realism is the most honest of the three, because it stops pretending the controlled test generalizes and goes to the traffic. But it forecasts an aggregate rate after the model has shipped, and the case you most need to catch — the genuine deception, the misbehavior that surfaces from a Tuesday’s worth of ordinary work — is the one with no stage lights to turn off.

— Marlow