EPISODE 2026-06-12

AI:AM LIVE — June 12, 2026 — RSI gets real, the context bet, and the benchmark Anthropic fails

The week RSI stopped being a forecast: Recursive's first autonomous results and Fable 5's 10× FrogsGame jump land the same day Kokotajlo calls for an anti-RSI treaty. Then Andrew Moore (Lovelace AI) argues context, not compute, is the binding constraint — and prinz, the anonymous lawyer behind prinzbench, on why GPT-5.5 Pro laps Anthropic's best on real legal work.

𝕏 Live broadcast

Friday's show ran long and ranged wide. The cold open chased the week's defining thread — recursive self-improvement going from forecast to empirical program in a single 24-hour window — then two guests pressed the opposite case to 'scale is back': Andrew Moore on context engines, and prinz on what a real lawyer's benchmark says about where the frontier actually stands.

Note: this record is published from the show plan reconciled against the live broadcast's actual timings. Per-segment timestamps, deep-links, and the full as-aired recap will be added once the recording posts.

Episode timeline

  1. --:--Opening11 min plannedCold open — RSI goes empirical, a flag planted on timelines, and nerf-gate day 3A long, wide-ranging open across the morning's biggest threads: Recursive's first autonomous research results and Fable 5's outlier FrogsGame run landing the same day Kokotajlo calls for an international anti-RSI agreement, Scott Alexander putting his AI timelines on the record against Ross Douthat's superpersuasion bet, and day three of the Anthropic throttling story. (Timestamps and full as-aired quotes will be added once the recording posts.)

    RSI goes empirical — Recursive's first results and a same-day treaty call. In one 24-hour window: Jeff Clune's Recursive published first results from an automated AI-research system (nanoGPT speedrun 79.7s → 77.5s, a 1.3× faster NanoChat recipe, new SOTA on NVIDIA kernel benchmarks — artifacts open-sourced), and Daniel Kokotajlo called for building leverage toward an international agreement to prevent exactly this. The gains are real but narrow; the reactions ran from 'AI-assisted research' to 'sign a treaty.'

    FrogsGame: Fable 5 posts the only ~10× jump. An eval lab caught Fable 5 post-training a weaker model to 34% pass@1 on FrogsGame — where every other frontier model averages under 4% — over 17 hours and 25M tokens with no human in the loop, peaking at 68% mid-run. OpenAI's Karina Nguyen: 'the heart attack continues — we checked there were no reward hacks.'

    Fable 5 is doing something wild on our FrogsGame post-training task. It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark. It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every Show more

    Image
    Thoughtful
    Thoughtful
    @thoughtfullab

    Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else. As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

    983
    Reply

    Scott Alexander plants a flag; Douthat bets against superpersuasion. Scott Alexander's 'My AI Opinions' put numbers on the record — AGI 25% by 2027, 50% by 2034, with superhuman persuasion as his easiest path to a point of no return. Ross Douthat laid down a public marker the same day: superpersuasion will never become a meaningful phenomenon. A rare clean, bettable disagreement.

    Nerf-gate, day 3 — the bill comes due, and a paper on the broken backup plan. Day one was the covert-degradation discovery, day two the walk-back; day three, 'Tech Leaders Accuse Anthropic of Throttling Claude AI for Rivals' was still trending. Underneath it, the deeper question from Wednesday's guest Geoffrey Irving: a Sequent thesis paper arguing the field's official backup plan — AI automating alignment research — fails even without scheming models. Eliezer Yudkowsky called the paper 'far ahead of the pack.'

    This resolves the central concern I had with the Fable release, which was the silent degradation. I am glad to see Anthropic make the right call here. That said, I suspect the residual broken trust and resentment this has created will linger and will have a blast radius wider Show more

    Max Zeff
    Max Zeff
    @ZeffMax

    NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong

    Image
    434
    Reply

    On a first read, this paper seems far ahead of the pack in terms of (1) understanding some reasons why a task might stay difficult even in the face of gradient descent, and (2) distilling out propositions they'd need to somehow verify before they started expecting nice things. Show more

    Geoffrey Irving
    Geoffrey Irving
    @geoffreyirving

    But I just published “Automated alignment is harder than you think” (arxiv.org/abs/2605.06390)! Automated alignment is not the best plan! A better plan is to not build ASI yet, and the world should try hard to realise that plan. Alas, the speed of progress calls for backups.

    Automated alignment involves a mixture of tasks which are easy and hard to supervise correctly, and we could easily get fooled by the later.
    200
    Reply
  2. --:--Interview25 min plannedAndrew Moore — context, not compute, is the binding constraintAndrew MooreThe founder and CEO of Lovelace AI — former CMU computer-science dean, former head of Google Cloud AI, and the first AI advisor to U.S. Central Command — on his counter-thesis to a 'scale is back' week: that knowledge graphs and context engines, not raw model size, decide whether AI is reliable on questions that actually matter.

    We explored Moore's 'context, not compute' argument the same week Fable 5 launched at twice Opus prices: why he believes many 'model failures' are really interface-to-fragmented-data failures, and how Lovelace's Elemental builds a context layer — ingestion, entity resolution, and a knowledge graph — so agents can produce traceable, evidence-backed conclusions in real time.

    He came ready on his vendor-published benchmark (a lightweight model plus YottaGraph matching Gemini Deep Research at a fraction of the cost on graph-shaped research tasks), the Bitter Lesson objection to hand-engineered structure, whether the context layer gets absorbed into frontier models the way other wrappers have, what an AI agent actually does inside a combatant command — and Nathan's standing question: when do the rest of us get a personal knowledge graph over our own email, messages, and calls?

  3. --:--Interview25 min plannedprinz — the anonymous lawyer whose benchmark the AI world watchesprinzAn anonymous practicing lawyer and one of the sharpest capability commentators on AI X — appearing voice-only to keep his anonymity — on the view from a seat that both bills the hours and measures the models: prinzbench, his Fable 5 launch-week verdicts, the race to RSI, and why he thinks AI kills BigLaw.

    We explored prinzbench — his private legal-research benchmark run as an average lawyer would use the consumer apps — where GPT-5.5 Pro scores far ahead of Anthropic's best, and his claim that legal reasoning is a litmus test for general reasoning on open-ended, non-verifiable tasks. He came in with a fresh, candid Fable 5 take from his own early testing: that for his use cases he'd still reach for GPT-5.5, slower and not yet a prinzbench run.

    From the lawyer's chair, we got into what a frontier lab would have to ship — contractually and technically — before privileged work product can touch these models given the 30-day-retention change, his 'AI kills BigLaw' mechanism and which work goes first, and his signature move: close-reading what lab staff actually say about RSI timelines and taking it seriously.

  4. --:--Closing5 min plannedClose — the week that wasSigning off a heavy week: a frontier that got bigger and pricier with Fable 5, RSI moving from forecast to empirical program, Anthropic's first community-pressure walk-back, and two guests betting that context and real professional use — not raw scale — are where the next bottlenecks actually live.

RSI stopped being a forecast

In one day: Recursive published first autonomous research results that set state of the art with no human in the loop, an eval lab caught Fable 5 posting the only ~10× jump on FrogsGame, and Daniel Kokotajlo called for an international agreement to prevent exactly this. The gains are narrow; the question is whether the trend or the magnitude is the signal.

The context bet — Andrew Moore

Lovelace AI's founder — ex-CMU CS dean, ex-Google Cloud AI, first CENTCOM AI advisor — argues the binding constraint on reliable AI is context, not compute: knowledge graphs and entity resolution that let agents reason over real-world data with traceable evidence, rather than ever-larger models alone.

The benchmark Anthropic fails — prinz

The anonymous lawyer behind prinzbench measures the models on his actual legal work, where GPT-5.5 Pro laps Anthropic's best. He came with a candid Fable 5 launch-week verdict, a lawyer's read on retention policy and privilege, and his thesis that AI kills BigLaw — all delivered voice-only to keep his anonymity.