EPISODE 2026-06-18

AI:AM LIVE — June 18, 2026 — AI Studios, Agent Factories, and Code: Judd Rosenblatt, Eno Reyes, Andrey Breslav

The opening covered the day's biggest AI stories: Midjourney founder David Holz's announcement of a whole-body ultrasonic CT scanner framed as the first vivid "AI dividend," Noam Shazeer's surprise move from Google to OpenAI, the Trump administration's demand for "uncircumventable" guardrails as a condition for Fable's return, and alphaXiv's paper-replication agents democratizing ML research. Then three guests: Judd Rosenblatt of AE Studio on gradient routing and neglected approaches to alignment; Eno Reyes of Factory on the software factory vision and why model-independence is the real moat; and Andrey Breslav — creator of Kotlin — on CodeSpeak and why "intent recovery" is the next layer of software engineering.

▶ Full show on YouTube

Thursday's show paired a wide-ranging opening — the AI dividend made literal, the biggest talent move of the cycle, and an incoherent guardrails demand — with three guests representing three distinct bets on where AI takes software and safety next.

Judd Rosenblatt of AE Studio made the case for "neglected approaches" to alignment and previewed gradient routing, a pretraining technique that routes dangerous capabilities into ablatable expert modules. Eno Reyes of Factory argued that the model layer is a commodity and the harness — not the frontier lab — is the durable moat in autonomous software engineering. And Andrey Breslav, who spent a decade designing Kotlin, explained why the future of software development is intent recovery and specification-first engineering with CodeSpeak.

The rundown

0:43Opening42 min
Opening: AI Studios, the Talent War, and Guardrail TheaterDavid Holz's Midjourney Medical whole-body ultrasonic scanner as the first AI dividend; Noam Shazeer joining OpenAI as the talent war's biggest move; the Trump administration's demand for uncircumventable guardrails as a technically incoherent bar; and alphaXiv's paper-replication agents democratizing ML research.
Watch
As aired
Nathan Labenz and Prakash opened the June 18 show with sustained enthusiasm for a cluster of stories they read as unambiguously positive signals for AI's real-world impact. The dominant topic was Midjourney founder David Holz's announcement of Midjourney Medical — a full-body ultrasonic computational tomography scanner promising a 60-second whole-body scan at minimal cost, with a stated goal of 50,000 units performing a billion scans a month. Both hosts framed it as a vivid early example of the "AI dividend": profits from frontier AI being reinvested into life-improving physical-world infrastructure by founders who, as Prakash put it, can deploy capital in ways that pension funds and institutional investors structurally cannot. Nathan connected the announcement to his own family's cancer experience, arguing that the coming wave of diagnostic abundance will be impossible for regulators or the medical establishment to stop — and that fears of false positives underestimate both patients and the AI interpretation layer that will rapidly improve once scan data exists at scale.
The hosts then turned to the day's biggest talent news: Noam Shazeer, co-author of the original transformer paper and most recently running Gemini at Google after Google's $2.6 billion acquisition of Character.AI, announced he is joining OpenAI. Prakash played a clip of Shazeer articulating an expansive, fast-timeline vision of AI compute consumption — personal AI cabinets, orders-of-magnitude GDP growth, solar-scale energy for data centers — and framed his departure as evidence that Shazeer felt his accelerationist timeline was no longer achievable inside Google, whether due to compute prioritization, strategic misalignment, or the organizational reality of sharing resources with YouTube during the Super Bowl. Nathan agreed the move is a strong signal of mission and impact, not economics, and added a note of caution: if Shazeer is moving because he expects a phase change toward recursive self-improvement and thinks OpenAI is better positioned for it, that's a reason to at least pause before celebrating.
The opening closed on two more items. On the government's demand that Anthropic demonstrate "uncircumventable" guardrails before Fable 5 can be rereleased, the hosts were blunt: the demand is technically incoherent — equivalent to demanding bug-free code — and Nathan argued the absence of any stated bad behavior from the model makes motivated regulatory targeting the most likely explanation, though Prakash urged a Hanlon's-razor read of bureaucratic inertia. Finally, the hosts greeted alphaXiv's autonomous paper-replication agents — which ingest arXiv repos, resolve broken setups, and let users sort papers by ease of implementation — as a quiet but powerful democratizing force: AI doing the unglamorous science that validates or debunks the claims underneath the field's rapid progress.
Key moments
I so often say the scarcest resource is a positive vision for the future, and this is one that it seems like everybody could get behind — the idea that you could get a one-minute scan and there could be radical abundance in seeing inside our own bodies.
Nathan Labenz2:31
What has struck me is that there's going to be a capital explosion as all the money being made in SpaceX, Anthropic, and OpenAI goes into the hands of fairly young, very empowered, technologically sophisticated entrepreneurial people — and they're going to be able to deploy this capital in ways that the normal pension fund trustee would not like or would not be able to support.
Prakash3:21
It's about equivalent to asking for bug-free code at this point.
Prakash29:39
What we covered
Midjourney Medical: the AI dividend, made literal David Holz took Midjourney's profits and announced a full-body ultrasonic computational tomography scanner — a 60-second whole-body scan claimed to be higher-resolution than MRI and nearly free to run, targeting a fleet of 50,000 units performing a billion scans a month. Both hosts framed it as the first vivid example of the "AI dividend": frontier-AI profits reinvested into life-improving physical-world infrastructure by founders who, as Prakash put it, can deploy capital in ways institutional investors structurally cannot.
Midjourney Medical announcement ↗
Noam Shazeer → OpenAI: the talent war's biggest move Noam Shazeer — co-author of "Attention Is All You Need," most recently running Gemini at Google after the $2.6B Character.AI acquisition — announced he is joining OpenAI. The hosts read it as a mission-and-timeline signal, not an economic one: Shazeer's accelerationist vision (orders-of-magnitude GDP growth, solar-scale compute) likely felt more achievable at OpenAI than at Google.
Noam Shazeer announcement ↗
The government's impossible demand — "uncircumventable" guardrails WIRED reported the administration's condition for Fable's return: Anthropic must ensure the model's guardrails cannot be circumvented. Both hosts noted this is equivalent to demanding bug-free software — technically incoherent — and Nathan argued that the absence of any stated bad behavior from the model makes motivated regulatory targeting the most likely explanation, with Prakash urging a Hanlon's-razor read of bureaucratic inertia.
WIRED: uncircumventable guardrails demand ↗
alphaXiv: agents that replicate the papers alphaXiv deployed autoresearch agents that ingest popular arXiv repos, fix broken setups, and get core claims actually running — letting users sort papers by ease of implementation. The hosts framed it as AI doing the unglamorous science that validates or debunks the claims underneath the field's rapid progress, and a democratizing force for anyone who wants to stand on the shoulders of ML research giants.
alphaXiv paper replications ↗
Full transcriptLightly edited · timestamps jump to YouTube
0:47
Prakash: Good morning. It is Thursday, June 18, 2026, and we have had a very eventful twenty-four hours. Nathan, good morning.
0:54
Nathan Labenz: Good morning, Prakash. How are you?
0:56
Prakash: Very good. What has struck your attention?
1:01
Nathan Labenz: Well, I think the whole AI world — from what I saw in the timeline last night — was just blowing up over Midjourney Medical's announcement and the beautiful launch video they used to show off their plans. I imagine everybody tuning in has already seen this, so I won't belabor it too much. But in short, they are planning to do a whole-body ultrasound scan productized as a kind of wellness offering — literally one minute, they say. You get lowered into a tank of water, which is needed because ultrasound doesn't propagate well through air.
1:46
Nathan Labenz: You get dropped into this pool of water, an ultrasound array all around you emits in a rotating fashion, bounces off you in a full 360 degrees, is captured and interpreted. And maybe one of the most striking things in what they showed is just really beautiful visualizations — obviously right in Midjourney's wheelhouse. Above all, I thought this was a great example of an AI company putting forward a truly different but fundamentally positive vision for the future.
2:31
Nathan Labenz: I so often say the scarcest resource is a positive vision for the future, and this is one that it seems like everybody could get behind. The idea that you could get a one-minute scan, that there could be radical abundance in seeing inside our own bodies, and even make it beautiful in a way that I think has more medical consequences than immediately meets the eye — I thought it was awesome, inspiring, and exactly the sort of thing we need more of. I would love to see more of this come from other people who have suddenly become very wealthy in the AI boom.
3:21
Prakash: I had a lot of thoughts on it. Number one: I love David Holz. He's been one of my favorite people in AI for a long time. It struck me that he used the wealth he obtained from Midjourney in such a positive way. And it also struck me that this is only the beginning, because one of my theses is that there's going to be a capital explosion as all the money being made in SpaceX, Anthropic, and OpenAI goes into the hands of fairly young, very empowered, technologically sophisticated entrepreneurial people.
4:06
Prakash: They're going to be able to deploy this capital in ways that the normal pension fund trustee would not like or would not be able to support. To a large extent, risk-taking in the US and globally is driven by pension funds and insurance organizations that are regulated by the state, and a lot of bets are not made for that reason. So one of the things I'm very excited about is that as wealth devolves into these hands — people able to make large bets — we're going to see more of them getting made. The second thing that struck me was that he decided to go into medical.
4:52
Prakash: Medical is 18% of the US economy, and it's always a growing percentage because whatever wealth we have, we end up using to improve our health. What struck me is that it has always been a very hard area to innovate in because of regulation, privacy issues, and so on. So he decided to tackle the hard problem — kudos to him. And as Nathan has pointed out before, the medical establishment is not on board with this. During COVID, for example, labs in Seattle back-tested samples for COVID and found cases even before the official announcements. They wanted to alert the positive patients but were banned from doing so by the FDA.
5:37
Prakash: The FDA doesn't like it when you test for something without prior consent, or disclose findings you haven't ethically cleared for disclosure. So the FDA actually banned those Seattle labs from notifying patients. And this is something that has always struck me about the US regulatory system: the FDA is very watchful because they don't want people to receive technical information about themselves and misinterpret it — the worry being someone acts on a 95% breast-cancer-risk reading without proper counseling. They don't believe the public is ready to receive very technical diagnostic information directly. This is not widely known.
6:22
Prakash: Everyone thinks they have a right to information about their own body. The FDA doesn't agree. Holz started by saying they would only do body composition — essentially body fat measurement — which is a big deal because when you take GLP-1s, you lose both fat and muscle. With this kind of body composition monitor, you can figure out how much muscle you're actually losing and whether continuing to lose weight is detrimental. That's an important use case, and also one the FDA doesn't regulate heavily, since current bone-density machines are essentially low-dose X-rays. This would give you the same information without ionizing radiation.
7:43
Nathan Labenz: If the government thinks they're going to block people from using this technology, I think they're going to have a real fight on their hands. I've talked about this probably ad nauseam at this point, but in the whole cancer experience I recently went through — fortunately, my son didn't have to get off the standard treatment protocol; it worked for him and the exotic alternatives we were scouting, we never had to actually try to get our hands on — I was already gearing up for battles on so many fronts.
8:28
Nathan Labenz: Just the DNA testing we did, which is not standard — and which, fortunately, our oncologist supported without much trouble — fundamentally changed my information landscape and how confidently I could believe he was, in fact, cured. I think we're at over 99% confidence now given all those results; we wouldn't have gotten there otherwise. And in terms of the hypotheticals — well, what if this next test came back slightly positive? The answer was essentially, we wouldn't treat on that anyway, we'd really need to wait for gross disease. I just think people are not going to be content with that for much longer.
9:13
Nathan Labenz: When we have these technologies — and especially this one, which is what makes it so promising — a little dose of skepticism is warranted. Will this ever actually ship? I don't mean to cast doubt, but it's not insane to wonder. But assuming they can deliver on their promise, the fact that it takes a minute and therefore will probably be pretty cheap, and the fact that it's so beautiful to look at — people will be able to study their own scans in a really effective way.
9:58
Nathan Labenz: Of course, all the AI interpretation on top of it as well — the medical establishment's reactions don't really account for how fast that's likely to progress once you have big data at this scale. The responses have been: well, ultrasound doesn't see this that well, we don't recommend whole-body scans because of false positives, and so on. That all feels like fighting the last war — a scarcity mindset on multiple levels. In a good future, people should have both the time and the motivation to think about their own health.
10:44
Nathan Labenz: Looking inside one's own body this way is just going to be captivating for a lot of people. I mean, how much time goes into how we look already? Now there's a way to look inside and potentially find things of interest. Sure, people will have scares that turn out to be nothing — that'll happen all the time. But I think it really underestimates people, and it really underestimates the technology and how refined it's likely to get, especially if they can scale it out and create the scan abundance they describe. False positive issues, I think, will fade away pretty quickly.
11:29
Nathan Labenz: We'll get good at reading these scans, both because nobody is more motivated to study them than the patients themselves, and when it's made beautiful and accessible, the latent potential for someone to read their own scan and triangulate it with other data or how they're feeling — that will be very powerful. But the refinement the AI layer will put on top of this as well is just something you don't hear any understanding of in the reactions from the cautious establishment that have come up so far.
12:17
Prakash: One of the things that struck me about the Midjourney Medical announcement is that this is what I would call an AI-native idea — in the sense that when David Holz decided to do something, he looked forward three to five years at things not yet available, and focused on creating new data. We've seen this with Periodic Labs and a couple of other firms where the defensibility of the idea comes from the fact that they have data no one else has, and they're the ones creating it. This is quite different from early foundation models and language models, which operated on other people's data — Reddit data and the like.
13:02
Prakash: The defensibility really comes from owning your own data at this point. And David Holz is a pioneer of this because when they set up Midjourney, one thing they did quite differently — which they stumbled upon early — was the idea of doing fast generation first and then forcing the user to pick which ones to upscale. They basically imposed A/B testing on image quality from the very beginning. And that, I think, helped them get to much higher quality much faster than they otherwise would have.
13:59
Nathan Labenz: There was a little weirdness on the audio there — do you hear me?
14:03
Prakash: Yes, I hear you.
14:04
Nathan Labenz: These data bootstraps are just the first step toward a much brighter future, really. I think back to how image understanding and image-text unity really started with some pretty rough stuff — the original CLIP was taking a huge number of photos, pairing them with their captions, and just trying to get an ML system to align its understanding of images with the understanding of captions.
14:50
Nathan Labenz: The problem initially was that those captions were generally very bad — did they even describe what was in the photo? If you think of photos on the internet and the way people caption them, some are very literal if they're product photos, but many are captioned with jokes or memories or notes to one another. All this meant the data was extremely noisy. And yet they were able to get enough of a signal out of that to open one notable path to image generation.
15:35
Nathan Labenz: With each generation of model that followed, a huge part of what they were doing was refining the data — using new captioning ability to clean it up, to refine the process, to get image and text more and more closely aligned, until we're at this point now where we have deep fusion. That process is happening over and over again in different places — Jim Fan, who we talked about yesterday with his open-source robotics kit, has previously articulated a very similar path for robotics.
16:20
Nathan Labenz: All kinds of video data can initially be a noisy source, but as they refine it there'll be models that translate random videos into first-person POV videos, which can then be used to train. All these little tricks will play out over and over again across all these different modalities. And this, I think, means that the ultimate quality of these ultrasound scans and our ability to interpret them will be far, far beyond what you get when a technician looks at an ultrasound today — and that's already not bad.
17:05
Nathan Labenz: This will be so much better. You could sort of see who has figured out that this pattern is going to repeat across all these different modalities — and who hasn't — based on whether they're taking that into account in their reactions.
17:35
Prakash: Maybe in terms of looking at reactions in AI — let me segue to one of the big pieces of news yesterday, which is Noam Shazeer. Noam Shazeer is one of the authors of the original transformer paper and has been one of the leading lights in AI for a couple of decades. Yesterday he made the bombshell announcement that he's leaving Google. He said: "I'm excited to share that I'll be joining OpenAI and look forward to working with the exceptional team there." This comes after he'd been at Google for about two years following their $2.6 billion acquisition of Character.AI — just to get their hands on him. And here he is, eighteen to twenty-four months later, leaving Google again. Nathan, what are your reactions?
18:34
Nathan Labenz: Well, this explains why I couldn't get him on the podcast while working through Google's comms team, for one thing. It's big news for sure, and I also don't know how big it is at the same time. Macro: a very significant percentage of everyone working at frontier companies right now has some history at Google DeepMind over the years. In a way, this all came from Google — the whole space is sort of a Google diaspora. And yet they continue to be very serious players with, I would say, probably still the deepest talent pool and the broadest set of research bets.
19:19
Nathan Labenz: Having lost all those people and still maintaining that position, there's good reason to think they'll survive this too. They also have about 25% of global compute — a huge strength that's not going anywhere. We've seen something similar with OpenAI: they've lost a ton of people over time, including basically the entire leadership team at the time of the Sam Altman firing, and the company has chugged right along. And yet this does feel like there might be something a little different going on here.
20:04
Nathan Labenz: Or maybe — even if all that's true and the company is bigger than one person — what does it signal? What does it mean? It's hard to feel like it means nothing. It's a vote of confidence in his ability to do what he wants to do. He definitely seems to expect pretty radical change on a not-crazy-long timeline. He feels like he'll be more able to impact that timeline at OpenAI than he would be at Google. And it clearly can't be an economic move — he's been very well compensated by Google. So it's got to be about mission and his sense of likely impact.
21:25
Prakash: Let me share a clip from a podcast.
21:38
Nathan Labenz: I think just more is always going to be better. If you think about what fraction of world GDP will people decide to spend on AI at that point — and what do those AI systems look like? Maybe it's some sort of personal assistant in your glasses that can see everything around you, has access to all your digital information and the world's digital information. Maybe it's like you're the president and you have an earpiece that can advise you about anything in real time, solve problems for you, give you helpful pointers — or you could talk to it — and it wants to analyze anything it sees around you for any potentially useful impact. And say it's your personal assistant or your personal cabinet, and every time you spend 2x as much on compute, the thing gets five to ten IQ points smarter. Do you want to spend $10 a day and have an assistant, or $20 a day and have a smarter assistant?
23:09
Nathan Labenz: Not only is it an assistant in life, but an assistant in getting your job done better — it makes you go from a 10x engineer to a 100x or 10-million-x engineer. From first principles: people are going to want to spend some fraction of world GDP on this. World GDP is almost certainly going to go way up — orders of magnitude higher than today — due to the fact that we'll have all these artificial engineers working on improving things. We'll probably have solved unlimited energy and the carbon problem by that point, and millions to billions of robots building data centers. The sun puts out something like 10 to the 26 watts — I'm guessing the amount of compute being used for AI to help each person will be astronomical.
24:28
Nathan Labenz: You've got to love it when somebody knows the energy production of the sun off the top of their head.
24:34
Prakash: Yeah — great
24:36
Nathan Labenz: sign of a true AGI-filled mindset.
24:41
Prakash: He's talking about 2030. He thinks GDP could be two orders of magnitude higher. His timeline has always struck me as the Dyson-sphere-by-2040 kind of thinking — much faster than the current trajectory, and much faster than Demis Hassabis's internal timeline as well. Demis has publicly suggested AGI around 2035, roughly, and has been persuaded to pull that back toward 2030, but still approaches it as a long-term project. Noam has always seemed considerably more accelerationist than many others at DeepMind, especially Demis.
25:26
Prakash: My sense is that he felt his accelerated timeline was no longer achievable at Google — perhaps because management had different ideas about how fast things should go and how much capital to commit. The original reason he left Google before Character.AI was that he wasn't given enough compute: he had to share resources with YouTube, and when YouTube had the Super Bowl, it caused electrical fluctuations that made his experiments nonrepeatable. Google's answer was essentially, well, you're sharing servers with YouTube. So I think this is actually the third time he's left Google.
26:49
Nathan Labenz: It's certainly not a great sign for Google. The short-timeline versus mid-timeline frame is a very good one. I can't imagine he has crazy compute scarcity today — if he did, if they really weren't letting him cook, that would be a pretty bad unforced error given the vast compute resources they have. You'd think he'd at least have enough to run the experiments he wants and prove out the techniques he believes in. But yeah — one big difference between Google and the other two companies right now is their level of commitment to recursive self-improvement.
27:34
Nathan Labenz: I could definitely see that being a vibe or strategy-level thing he felt out of sync with, especially given that clip you just played. Even if he has enough to do what he wants today — is he seeing a phase change coming that he feels like OpenAI might be on the right side of
28:27
Nathan Labenz: and Google might not? I could definitely see that happening. To the degree that's true, I sort of want to take Google's side and say, let's not race into that. That shouldn't be a neglected point in this whole conversation — is it wise? Still, I think, very much an unanswered question.
28:46
Prakash: Noam is on the accelerationist pathway — less concerned about safety and more focused on rushing ahead into the future. Segueing to safety: we have the US versus Anthropic situation, the government's impossible demand for uncircumventable guardrails. Trump administration officials told WIRED that if Anthropic wants to re-release Fable 5, it will need to ensure the model's guardrails can't be circumvented. Security experts say that can't be done. And that's where the impasse sits between the two sides right now.
29:34
Nathan Labenz: Once again, it's the smartest of times. It's the stupidest of times.
29:39
Prakash: It's about equivalent to asking for bug-free code at this point.
29:44
Nathan Labenz: Yeah. I mean, we still don't know what's really going on. It all broadly adds up to motivated reasoning, selective enforcement, and ultimately bullying and lawfare by the administration toward Anthropic. That seems overwhelmingly likely at this point, especially given that we still haven't heard any credible explanation of what bad behavior was actually observed. The absence of that alone is so telling.
30:29
Prakash: I always say: when dealing with government, never attribute to malice what can be attributed to incompetence. I think people in government have other things to do besides dig into the security details of AI, especially when it's not materially affecting the market or even the company that badly. They're thinking: let's just deal with it later. You have the Iran situation, you have other things to worry about. It's just not in anyone's priority basket right now. The subordinates handling it don't have rule-making power, so it has to circulate at that level until a bigger actor — say, Sam Altman needs to release his model — forces a defined process. At that point it gets clarified and Anthropic gets to release, I think.
31:31
Nathan Labenz: My grandmother used to say, when she was growing up, "you can always trust the government" — and when she told me that as a kid, she said, "can you believe I used to say that?" Your theory is not a terrible one in normal times, but I'd also note that this government has been telegraphing its malice pretty clearly. We've had very aggressive commentary from Hegseth, quite a bit from Trump himself. You can only call somebody a radical wokester or whatever so many times before one can reasonably infer there might be some malice somewhere in the thinking behind these various moves.
32:31
Nathan Labenz: It's really scary if what you're saying is true — that it's just not a priority. They did have that working meeting yesterday with heads of state and company leaders. And I think there has to be some way to square this. I'm not particularly looking forward to having the rate at which my Fable queries get downgraded to Opus go from 5% to 25%.
33:11
Nathan Labenz: But there has to be some gray area, some number of nines the administration would accept if they're acting in any kind of good faith. Especially because out in the long tail of jailbreaks, a lot of them are very subjective — is this really a bad behavior, or was it role-playing? We're not going to ban fiction, I don't think, anytime soon. A fictional story about a villain doing bad things with a bit of light technical detail is maybe a marginal jailbreak at best; it doesn't move the needle on anything real.
33:56
Nathan Labenz: So I have to believe there's a trade-off they could make that would add a lot more false positives for the time being — make everything more annoying and somewhat dumber — but get to a reliability level where exceptions are rare enough and benign enough that no good-faith actor could say you haven't done enough. They don't apply this standard to anything else. This is totally out of character for this government. We'll continue to watch, but I think the demand can be met enough that there should be acceptance of that effort. Anthropic has already taken pains; they've already endured ridicule for how safety-focused they've been. They could turn that up again. And if they do, and the administration still doesn't accept it — then it will be extremely telling.
35:42
Prakash: Let me segue to introducing paper replications for arXiv. Auto-research agents ingest popular repos, resolve setup issues, and get the core claim running — and they sort papers by ease of implementation. This is alphaXiv, and they've actually built an auto-replication agent. Nathan, can you describe what this means?
36:14
Nathan Labenz: I think it's awesome — and it's very similar in spirit to the Jim Fan open-source project that lets you set up your own micro-robotics lab. This one is for anyone who wants to stand on the shoulders of ML research giants. Some papers have published their code; some haven't. There are different levels of method disclosure across papers, which is why the sort-by-easy-to-hard makes sense. At the top of the easy sort are papers with full repositories you can clone and immediately start doing riffs on. My small contribution to the emergent misalignment paper was very much like this — that paper came out in early 2025, the work was late 2024 to early 2025.
37:02
Nathan Labenz: And already at that time, joining a project and getting access to the repo with the Gemini million-token window — the new hotness at the time — it was just: oh my god, I don't have to read all this research code. It can read all the research code. Which, by the way, is typically not super well-documented, not structured in the same maintainable, logically sensible, separation-of-concerns way that production engineering code tends to be. Research codebases are often just a bit of a mess — someone put it together to get to the answer, not to make a product.
37:48
Nathan Labenz: So it's been a real challenge for many people, many times, to figure out: what did they really do here? Can I reproduce this? How much will that take? The rewards in science are much more for coming up with something new than for reproducing something already done. But successful reproduction — or failure to reproduce — is a core mechanism in science, and this is exactly the kind of thing AI should be able to accelerate dramatically.
38:05
Nathan Labenz: We're seeing AI for science play out at a few different levels, and this one — relative to a science agent that goes out and makes a brand-new discovery — will probably fly under the radar. But it should be understood as a huge enabling force: it will democratize access and allow us to separate the research that's real from that which is varying levels of faked, from p-hacked to outright fabricated. The collective sense-making and the frontier of knowledge should move significantly faster because we can do this now.
39:11
Prakash: What has always struck me is how machine learning processes are maybe a couple of generations ahead of every other scientific field. In bio, for example, your research goes into a journal that people have to subscribe to, gets peer-reviewed, and comes out a year and a half to two years after the initial work. It's very slow, with a lot of inconsistencies. What I love about ML is that from early on, techniques to improve the research process have been applied widely. I think this is exciting, and I also think it's almost like a funnel through which all other kinds of science will have to pass — providing all their data, enabling replication checks, statistical review, process audit, step-by-step reproducibility. Hopefully that kind of thing reduces the epidemic of fake science.
40:42
Prakash: I don't even know how much fake science there is in the world — every few years some major reversal comes out, especially in nutrition, which has been completely corrupted by large food companies. I hope to see this spread into every scientific field so we have more replication of all of it.
41:03
Nathan Labenz: It's one of these pure goods. I've been so focused on the political dysfunction and the telenovela drama among company leaders and their long personal histories. It's good to spend today with a few really positive stories that seem very unambiguously to move things forward and inspire people to get into the game. That's another thing — with tools like this,
41:49
Nathan Labenz: with vibe coding in general, you can do ML research. Yes, you can do ML research. You don't really need a deep background in math. You don't have to know how GPUs work or worry about kernels. There is so much work you can do at a relatively high level because the translation from ideas to implementation — especially with something like this, but really just with AI coding help broadly — is maybe 98% of the way solved compared to what it used to be in terms of the barrier to entry.
42:34
Nathan Labenz: So I think this is a great additional signal for people who have ideas, or just questions they want to answer, to get in the game and truly stand on the shoulders of giants. I've seen a little bit of that from people who've never coded before, but I think we could see a lot more of it basically starting now. There's no reason to delay any further.
43:00Interview31 min
Neglected Approaches to Alignment — Judd RosenblattJudd RosenblattAE Studio CEO Judd Rosenblatt previewed gradient routing — a pretraining technique that routes dangerous CBRN and cyber capabilities into ablatable expert modules — and made the case that the alignment community's political monoculture prevents it from effectively engaging the Trump administration, which he urged colleagues to welcome as AI safety's first serious political engagement.
Watch
As aired
Prakash introduces Judd Rosenblatt, CEO and co-founder of AE Studio, a bootstrapped product agency that channels consulting profits into neglected approaches to AI alignment and neurotechnology research. Judd opens by explaining AE Studio's model — funding alignment R&D bets with strong hunches that no one else believes in, analogous to how breakthroughs like relativity and RNA vaccines emerged from contrarian persistence. He highlights their most imminent release: gradient routing, a pretraining technique that routes dangerous capabilities (CBRN, cyber) into dedicated expert modules in mixture-of-experts models, which can then be ablated — removing harmful knowledge from the publicly deployed model and addressing a fundamental weakness in post-training-only safety approaches. Nathan and Prakash engage on the absorption property of gradient routing (unlabeled data from a dangerous domain auto-routes to the seeded experts) and on what this means for the current political controversy around AI safety.
The conversation pivots to the Trump administration's crackdown on Anthropic and frontier labs. Judd offers a steel-man grounded in empathy — the same cognitive property AE Studio is trying to engineer into AI models. He argues that the alignment community, overwhelmingly left-of-center (AE Studio surveys found less than 2% of alignment researchers and less than 1% of effective altruists are right-of-center), cannot effectively model the administration's perspective, leading to counterproductive communication. He cites Jonathan Haidt's research on political tribe-based reasoning and urges the safety community to welcome the administration's first serious engagement with the risk rather than treating it as hostile.
The final arc addresses what policymakers and funders should actually do. Judd argues the highest-impact action is catalyzing simultaneous neglected-approaches alignment R&D — noting it is trivially cheap relative to compute investment. He emphasizes that recursive self-improvement is under-anticipated even within the alignment field and that most researchers cannot articulate a single alignment approach that would survive it. He closes by calling for AE Studio's model to scale: more ambitious, concurrent R&D bets, better evaluation regimes that stay current with technical developments, and a more transparent oversight structure that replaces today's opaque de facto regime.
Key moments
Most of the safety training is done in post-training, not in pretraining — so once a model is jailbroken, you can do whatever you want. We set out to solve that at an earlier stage. With gradient routing, you route dangerous capabilities into specific experts during pretraining, and then you can later ablate those experts and completely remove that knowledge from the public model.
Judd Rosenblatt45:47
If you don't have empathy for another agent, it's hard to have an effective model of how their mind works. My main message to people working in AI safety is: try to model how the Trump administration's minds are working — which means recognizing that their prior experience of everything in AI is very, very different from yours.
Judd Rosenblatt54:31
Almost nobody can tell me a single alignment R&D approach that would actually survive recursive self-improvement — even something very unlikely to work but high-impact if it did. You have to change the way you think about alignment entirely when you take that scenario seriously.
Judd Rosenblatt1:10:11
Questions asked
44:15Can you give us a short rundown of what your team has been working on recently?
AE Studio pursues simultaneous neglected approaches to AI alignment — bets that are potentially unlikely to work but extremely high-impact if they do. Their most imminent release is gradient routing: a pretraining technique that channels dangerous capabilities (CBRN, cyber) into dedicated expert modules in mixture-of-experts models, which can then be ablated from the public version of the model. Judd argues this addresses a root cause that post-training safety filters cannot — once a model is jailbroken, post-training guardrails are easily bypassed.
52:03How do you balance the capabilities trade-off against the safety gains from routing dangerous knowledge into specific experts?
Judd reports that so far gradient routing does not produce major capability degradation — capabilities route into capabilities experts while dangerous knowledge routes into dedicated dangerous experts, and the two can be cleanly separated. There will likely be some capability decreases, but not ones that would prevent deploying very powerful models. Nathan added that some capability reduction is actually the intended outcome: rather than patching with behavioral filters, gradient routing removes the knowledge at the source.
54:31What's the steel man for the Trump administration's actions against Anthropic over the last week or two?
Judd argues the alignment community — which AE Studio surveys found to be less than 2% right-of-center — structurally cannot build an accurate model of the administration's perspective, echoing Jonathan Haidt's research showing political affiliation overrides informational content in persuasion. From the administration's viewpoint, they were told AI was dangerous, tried to act, and met resistance from a field they perceived as adversarial. Judd says that, understood that way, their decisive action is entirely reasonable, and the alignment community should welcome it as a first serious engagement rather than treat it as hostile.
1:09:26What can policymakers realistically do about AI risk in the limited time window before recursive self-improvement?
Judd's answer is to heavily catalyze alignment R&D — it is trivially cheap relative to compute investment, and the field is starved of ambitious bets rather than talent. He stresses that almost no one in the alignment world is seriously thinking about what alignment approaches would survive recursive self-improvement, and that is the question that should be driving major government and lab research programs. Concretely, this means less opaque evaluation regimes that stay current with technical developments and communicate findings to decision-makers, combined with oversight designed to enable innovation rather than just block deployments.
1:02:56What would a genuine Manhattan Project for alignment look like — and what are frontier labs and funders getting wrong today?
Judd argues the key failure is short-termism: frontier labs and large funders want results fast and therefore under-invest in ambitious, long-horizon bets. AE Studio's model — using profitable consulting to fund weird, potentially-unlikely-to-work R&D — is the template he wants to see scaled. He notes that because everything is accelerating, the payoff horizon for ambitious alignment R&D is compressing; the long run is arriving sooner. He also flags China's rapid integration of AI into government as a strategic pressure that makes the case for massive, parallel alignment R&D even more urgent.
Related
AE Studio ↗Judd Rosenblatt on X ↗
Full transcriptLightly edited · timestamps jump to YouTube
43:01
Prakash: On that note, let me introduce our first guest for today — Judd Rosenblatt, the CEO and co-founder of AE Studio. Judd is a serial tech entrepreneur who previously founded the nationwide food-delivery platform Crunch Button. A few years ago he realized that humanity was racing to build the most powerful technology in history without knowing how to keep it under human control. Instead of taking venture capital, Judd bootstrapped AE Studio into a powerhouse product agency. They build custom software and complex AI agent workflows for enterprise clients, then funnel the profits directly into internal research on AI alignment and neurotechnology. Judd and his team operate on the belief that mainstream AI safety — which relies heavily on surface-level behavioral filters — is fundamentally fragile.
43:46
Prakash: AE Studio is pioneering what they call neglected approaches to alignment, borrowing concepts from cognitive neuroscience to physically alter the internal architecture of models so that they possess genuine traits like empathy and self-correction. Judd, welcome to the show.
44:06
Judd Rosenblatt: Thanks for having me.
44:08
Prakash: Judd, can you give us a short rundown of what your team has been working on recently?
44:15
Judd Rosenblatt: Absolutely. What we do is pursue a whole set of simultaneous neglected approaches to AI alignment — things that are potentially unlikely to work, but extremely impactful if they do. The fundamental thinking is that if you look at the history of science, the biggest breakthroughs often come from people with strong hunches that no one else believes in who stick with them long enough to discover relativity or RNA vaccines. So we find people with those hunches, and then we use the same structures as our AI consulting company — great AI engineers, project management — and treat those neglected-approach visionaries as the client.
45:01
Judd Rosenblatt: We've also started to get more opinionated about what we pursue as our work has had a bigger impact. One thing we'll be sharing fairly soon is extremely relevant to the current political moment: right now, whenever any new model comes out, within hours or days "The Liberator" jailbreaks it. People are concerned about cybersecurity, but the much bigger issues are CBRN risks. The fundamental problem is that most safety training happens in post-training, not in pretraining — so once a model is jailbroken, you can do a lot of dangerous things with it.
45:47
Judd Rosenblatt: The approach we've been accelerating is called gradient routing. In pretraining, you route different dangerous capabilities into different experts in a mixture-of-experts model. You wind up with dedicated "dangerous experts" that learn specifically the CBRN or cyber material. Then you can later ablate those experts — completely removing them. You're left with the regular model and a safe public model. It's still early-stage, but we're excited to release it soon because it potentially solves the jailbreak problem that so many people are very concerned about right now.
46:32
Judd Rosenblatt: Our larger thesis is that if the field had been investing more in alignment R&D instead of just scaling compute, and we'd done this earlier, we'd already have techniques like this — and you wouldn't have the current controversy between the Trump administration and Anthropic over Fable 5. Of course, gradient routing is only validated in smaller models right now, but it seems to improve with scale, and we're hopeful it enters future frontier models.
47:58
Nathan Labenz: I love the neglected-approaches model. Longtime listeners know I'm a big AE Studio fan — I had a fun moment not long ago in a conversation with someone from OpenAI and your teammate Diogo, and I kept saying, "tell them about this one, tell them about this one," and the OpenAI person said, "how is every single one of these a banger?" I really encourage people to go through the AE Studio archive. On gradient routing specifically — correct me if I'm wrong, but my understanding is that with labeled data you can freeze all experts except one or two and get dangerous knowledge to predictably flow into those specific experts. But then the really exciting reveal is that once you've seeded them, even unlabeled data from the same domain tends to update those same experts.
49:28
Nathan Labenz: Physics is kind to us in that way, because if you only have to label a portion of the dangerous data and then the gravity well of gradient descent pulls all the other relevant knowledge into the same place, you have a much more robust solution. You don't have to exhaustively label every relevant data point.
49:56
Judd Rosenblatt: Yes, this absorption quality is really cool. I should mention I'm not supposed to say too much since we're about to publish — supposedly sometime this week or next. But you've got it right: we don't fully understand exactly how the absorption works, but once you have a sufficient amount of labeled data, it does pull in everything else relevant to that domain. And we think we need to scale this up as soon as possible, because it promises to address the exact issue the Trump administration is — understandably — concerned about. They're trying to understand the problem for the first time and taking it seriously, and I think they're going to be looking for real, actual solutions.
51:26
Judd Rosenblatt: We're pretty excited to try to move neglected approaches rapidly into real solutions that address some of the most pressing issues coming to a head right now.
51:40
Prakash: One thing that strikes me: when you curate pretraining data, you often also affect capabilities on the other end. How do you balance the capabilities trade-off against the safety gains from this kind of routing?
52:03
Judd Rosenblatt: So far we are not seeing a major capability impact. Capabilities wind up getting routed into the capabilities experts. The simplified version: you have the chemical-weapons expert and then a separate expert that understands chemistry more broadly. There will likely be some capability decreases, but not sufficiently major ones to prevent deploying very powerful models.
52:42
Nathan Labenz: And in some ways the capability reduction is the whole point. The approach today is layer on behavioral filters, maybe do some internal activation steering to degrade responses in sensitive domains — it's patches all the way down, and those will probably survive as defense-in-depth. But wouldn't it be transformative if you could take a frontier model and literally pluck out the chemical-weapons knowledge, pluck out the virology knowledge? You'd be far more confident — even before you apply all the additional filters and monitors — that the knowledge simply isn't there in the version the public uses. That's a genuine step change from what we have today.
53:28
Nathan Labenz: Judd, can you tell us more about how you understand what the Trump administration is doing? My reaction has been fairly cynical — it feels like they're singling out one company rather than approaching the industry as a whole. But you've articulated one of the most sympathetic readings I've heard. What's the steel man for what we've seen over the last week or two?
54:31
Judd Rosenblatt: It is hard for many people in the AI safety world to have empathy for the Trump administration. And that's unfortunate — because a lot of our neglected approaches actually involve empathy as a mechanism, which we think may be fairly pivotal for solving alignment. So it's interesting: if you don't have empathy for another agent, you can't build an effective model of how their mind works. My main message to people in AI safety is: try to model how the administration's minds are working. That means recognizing that their prior experience of everything that's happened in AI is very, very different from yours.
56:07
Judd Rosenblatt: Imagine being told: this technology is dangerous, stop it, we need to slow it down — and then not getting a receptive response from someone the AI safety community has, historically, found difficult to work with. I think if you understand it from that perspective, their action is actually quite reasonable. And in fact, most people in the alignment world should be heartened that Trump saw something happening that alarmed him and took decisive action. I've been predicting for years that he would be someone capable of that kind of move when AI starts to accelerate and things get serious.
56:52
Judd Rosenblatt: Think about how you introduce someone to AI for the first time. How many conversations does it take before they update on how much the world is about to change? It takes time, and first reactions are always rougher than later ones. I think it is incumbent on us to set the Trump administration up for success as they begin playing a larger role in this space. They can make very good decisions and learn quickly — if they're engaged with empathy.
58:23
Judd Rosenblatt: We ran surveys of hundreds of alignment researchers and effective altruists. Less than 2% of alignment researchers were right-of-center politically. Less than 1% of effective altruists were right-of-center. Of effective altruists, 40% were extremely progressive and another 40% very progressive. Jonathan Haidt's research shows how hard it is for people to empathize across political lines — and the informational content of an argument is basically irrelevant to whether someone believes it; what matters is which political party is associated with it. I was disappointed by the alignment community's reaction to last week's events, because the right response is to welcome the administration finally taking this seriously.
59:53
Judd Rosenblatt: We all have exponential-slope blindness — we didn't evolve to intuitively model what exponential growth feels like over a human lifetime. That's why people didn't predict where we are now. But everyone is also over-indexed on the present and under-predicting how much bigger and crazier things will get. We want an informed, competent set of people making smarter decisions when that happens. I think the blame for the current confrontation belongs more to the alignment community than to the administration — because, on the broader trajectory, building productive relationships with whoever holds political power is the thing that actually matters for a good outcome for humanity and the future of consciousness.
1:01:32
Nathan Labenz: Jeffrey Ladish and Leron Shapiro — both from the AI safety world — expressed a similar sentiment earlier this week: this action is a good move even if it's not as technically grounded as we'd like. How will you know if you're right or wrong? What does resolution look like? And what should people with resources — including companies about to IPO — be doing? You've been critical of the OpenAI Foundation for investing in things that are, in your view, too incremental. What does a genuine Manhattan Project for alignment actually look like?
1:02:56
Judd Rosenblatt: The highest-impact thing that can be done is catalyzing more simultaneous neglected-approaches R&D — ambitious projects that might work. It is trivially cheap relative to what we're spending on compute. Any competent person with an interesting idea can start working on alignment R&D today because it's so easy to build things. We can motivate and inspire the brightest people in the world to go work on these problems in weird, ambitious ways. The government has substantial ability to fund that. Frontier labs could be investing far more heavily in alignment R&D. The key constraint is the willingness to try things that are unlikely to work but high-impact if they do — and frontier labs tend to think too short-term, wanting results too fast.
1:04:27
Judd Rosenblatt: I'm glad to see Anthropic doing more useful work in this space, but they could be doing orders of magnitude more. And the unusual thing right now is that historically it took a long time for ambitious, long-run bets to pay off — but because everything is accelerating, the long run is coming sooner. The ambitious R&D can pay off much faster. It still looks dubious at first. But investing in it, and actually solving these ambitious problems, also lets you solve the near-term urgent ones — CBRN risk, sleeper agent detection, cybersecurity. All of it is going to matter in the very near term.
1:05:57
Judd Rosenblatt: We also need to figure out how to rapidly deploy alignment innovations as they emerge — especially as other actors move quickly. China is integrating AI throughout its government at every level, effectively building a cyborg state. We're already in a competition with China that most people don't fully recognize. And the people who say we just need a pause — okay, but then what? You still have to solve the alignment problem. You still have to solve a large set of unsolved, ambitious R&D challenges. And meanwhile the full forces of capitalism are behind making algorithmic improvements that reduce the effective cost of compute and inference. You need to be investing in alignment R&D in advance of that, not in reaction to it.
1:07:50
Prakash: Policymakers right now are in a fog of war. They don't have deep technical knowledge; they don't know who to trust. They've heard that AI safety is just hype and regulatory-capture marketing from one side and existential warning from the other. In finance it took over a hundred years to develop the SEC and all the different regulatory bodies. We may have eighteen months before recursive self-improvement arrives. What can policymakers realistically do in this window besides telling companies to stop?
1:09:26
Judd Rosenblatt: The highest-impact thing policymakers can do is heavily catalyze alignment R&D. If recursive self-improvement is coming, you need to ask: what will survive it? What stays invariant when AI can improve itself? What will be selected for when AI is modifying itself without humans in the loop? Why would it retain alignment properties if those properties don't increase capability? Those are the questions that need to be driving major research programs. And it's remarkable how few people in the alignment world are seriously grappling with that specific question.
1:10:11
Judd Rosenblatt: When I interview ML engineers and alignment researchers, I ask them to tell me about AE Studio's approach, and then I ask them to name a single alignment R&D approach that would actually survive recursive self-improvement — even something very unlikely to work but high-impact if it did. Almost nobody can answer. You have to change the way you think: ask what properties would actually survive that scenario, rather than what looks good on a paper today. I think the Trump administration is going to be looking for real solutions. And the AI safety field has mostly been sounding alarms without pushing forward credible paths to solutions. What you need — if not a solution — are paths to possible solutions, with sufficient investment in the R&D necessary to pursue them.
1:11:42
Judd Rosenblatt: What that means concretely is beginning to articulate those paths, combined with oversight that doesn't get in the way of fundamental innovation. Now that we've seen the administration is willing to stop public releases, we can establish serious evaluation regimes that stay current with technical developments and communicate findings to relevant people in politics. Dean Ball and others have noted we already have a de facto opaque regime — that's unfortunate, but if you project forward, something less opaque needs to be established, and I think we're tracking toward that, hopefully in the not-too-distant future.
1:12:59
Nathan Labenz: For those playing the AI in the AM drinking game, that's one mention of Dean Ball — everybody take their shot, and keep your ears open, there may be more. Judd, I really appreciate you joining us this morning. Bigger picture, I deeply appreciate the neglected-approaches philosophy. I always encourage people with potentially unconventional alignment ideas to pursue them — most won't work, but we need many more minds and many more different kinds of minds on this problem. AE Studio has been genuinely visionary and laudable in putting resources where your values are: inviting people with strange ideas into the game and showing that approach can actually produce results.
1:13:44
Nathan Labenz: Some of what you've produced so far shows there's a rich vein to be mined. Keep up the great work, and we'll hope to have you back here to help us make sense of things before too long.
1:14:09
Judd Rosenblatt: Thanks so much. Thanks for having me.
1:14:12
Prakash: Bye, Judd.
1:14:27Interview28 min
The Software Factory — Eno Reyes, FactoryEno ReyesFactory co-founder and CTO Eno Reyes argued that the model layer is a commodity and the harness is the moat, explained why "agent readiness" — not smarter models — is the binding constraint on autonomous software engineering, and sketched a future where software organizations look like capital allocators running self-optimizing feedback loops.
Watch
As aired
Prakash introduces Eno Reyes, co-founder and CTO of Factory, framing his thesis that the future of software engineering lies not in writing code but in building the deterministic systems that build the code. After a brief audio hiccup at the segment's start, Eno lays out Factory's vision of the 'software factory': a giant feedback loop running from world signals through prioritization, development, deployment, and back — today almost entirely human-driven, but increasingly instrumentable end-to-end with AI. He emphasizes that the challenge is less about model capability than about organizational reframing and the missing measurement infrastructure for software quality.
Nathan presses on whether smarter models solve the problem, citing Claude Fable's strong Frontier Code benchmark results. Eno argues the benchmark actually illustrates his point: Fable succeeded by leaning hard on tests, linters, and type-checkers — the deterministic feedback loops already baked into well-maintained open-source repos. Without those loops, no model is sufficient. Prakash extends this to the enterprise legacy-code reality: organizations resist test-driven development until the stakes rise. Eno says the AI era has raised those stakes dramatically because agents, unlike humans, can't be trusted to apply judgment around gaps in verification — making 'agent readiness' a first-order investment.
Nathan pivots to competitive landscape: Cursor's $60 billion acquisition option, Factory's own unicorn raise, and whether players are converging or diverging. Eno argues for durable differentiation — model-independence and a proprietary harness are essential because models are commodities and black-box harnesses will constrain independent business. He sketches a future where software organizations look like capital allocators — VCs, Berkshire-style operators, boutiques — each using AI to optimize their own feedback loops. He closes by identifying the buy-vs-build boundary: closed systems of record (Salesforce, Slack) are worth buying; what he wants to see more of is deep-tech innovation that unlocks new modalities and agent-interaction surfaces.
Key moments
You can't just drop in a great model, and you can't have agent readiness with a bad model. You need to invest in upgrading the deterministic feedback loops — because it's a risk question: humans have to decide, at some point, that they're going to start accepting code changes they haven't read.
Eno Reyes1:27:43
In a couple of years, we'll be setting the trajectory of these feedback loops with very high-level goals and setting constraints and budgets — essentially looking like VCs or capital allocators. The different strategies capital allocators take today give you a picture of what software organizations will look like.
Eno Reyes1:35:54
People on very high agent-ready codebases are just ripping — they can say 'please turn my natural language into software that works,' and it consistently delivers. People on very low agent-ready codebases are struggling, wasting tokens, spending huge amounts of money asymmetric to the rest of the org.
Eno Reyes1:32:18
Questions asked
1:19:28How do you deal with the bottleneck of humans having to be responsible for all the code AI agents produce?
Eno draws an analogy to self-driving cars: there's a reliability bar that's necessary but insufficient to simply hand over the wheel. For software, we lack the equivalent of fatality metrics — feedback cycles are too long to draw inferences at review time. Factory's focus is building the measurement infrastructure to trace bugs to their source, quantify AI-vs-human error rates, and define a bar where humans can confidently cosign AI-generated work, much as a VP takes responsibility for systems they designed rather than every line they wrote.
1:25:27Does Claude Fable's strong Frontier Code performance suggest smarter models are the solution after all?
Eno says Fable's success actually proves his point: it won by leaning hard on tests, linters, and type-checkers already present in well-maintained open-source repos — the deterministic feedback loops he advocates for. Without those loops, no model helps. He argues models have been sufficient for full-auto since roughly Claude Opus 4; recent gains come from models getting better at reaching correctness without needing those loops, but organizations still need to build agent readiness to capture that value.
1:30:03Can AI agents help legacy enterprise organizations undertake the refactoring needed for test-driven development, given how much resistance that has historically faced?
Eno argues the AI era has permanently raised the stakes. Agents can't apply human judgment to route around missing quality gates the way developers do — so the cost of not having verification loops is now much higher. Factory's recommended formula: audit your agent-readiness (deterministic feedback loops), then layer in automations for code review, security, and QA, then hill-climb from there. Even low agent-readiness organizations benefit, but those at the high end of readiness are dramatically outperforming — it's a significant explanatory variable for both productivity and cost.
1:34:23Are players like Factory, Cursor, Cognition, and Bolt converging on the same destination or heading in sufficiently different directions that there's room for everyone?
Eno sees some necessary convergence at the infrastructure layer — everyone needs a great agent harness — but argues model independence is essential and black-box third-party harnesses are a strategic liability. Factory's distinct bet is that software organizations will come to resemble capital allocators: VCs betting on product baskets, Berkshire-style scale operators, and boutique specialists. Factory's goal is to be the infrastructure that helps every profile assemble its software factory, regardless of what specific software they build.
1:39:30As CTO and a buyer, what kinds of products do you want to see in the market that AI agents will need?
Eno is happy to keep buying closed systems of record (Salesforce, Slack) where rebuilding the workflow would be irrational. What he wants to see more of is deep-tech innovation that unlocks new modalities — the ElevenLabs category of hard research problems that enable genuinely new classes of interaction. He'd spend heavily on a device that meaningfully improved how humans interact with Droids for computing, and sees a general shift away from incremental SaaS toward fundamental research as the next frontier of value creation.
Related
Factory ↗Eno Reyes on X ↗
Full transcriptLightly edited · timestamps jump to YouTube
1:14:27
Prakash: I have Eno Reyes, who is the co-founder and chief technology officer of Factory, an artificial intelligence research lab and platform company building systems that allow engineering teams to create and run their own autonomous software development pipelines. Instead of merely providing an assistant to help a human write individual lines of code faster, Factory builds what they call a software factory — collections of agents, which Factory calls Droids, that can autonomously handle large chunks of work from writing and testing code to reviewing security protocols and generating video-enhanced documentation. With a background spanning cognitive science, deep learning for neural data, and open-source tooling at Hugging Face, Eno brings a highly pragmatic, systems-level worldview to artificial intelligence.
1:15:12
Prakash: The central argument is that the future of software engineering is not writing code, but building and refining the deterministic systems that build the code. Eno, welcome to the show — tell us what you've been working on recently.
1:15:27
Eno Reyes: Thanks — that's a great intro. One of the things we've been thinking a lot about, which you referenced, is: what does the next step after coding agents really look like? At Factory, we've held the opinion that it would be something you'd call a software factory. And especially recently, that vision has started to become grounded in a more practical reality — seeing where organizations are really looking to accelerate beyond just 20, 30, 50% improvements in the speed with which they build and ship high-quality software.
1:16:12
Eno Reyes: As the technology has progressed, it's become clear that we're building toward a future where AI systems will act on signals from the outside world that decide what our software should be. Think about all the signals we as humans listen to — reading Twitter, looking at competitors, industry news forums, internal Slack and Teams conversations, telemetry, product analytics. Today it's this very human process of absorbing those signals, letting your brain draw inferences, and building prioritization.
1:16:58
Eno Reyes: That's roughly where the formal process starts: you might bring it into a PRD system or some structured guidance where humans write down plans and priorities. Those go to software developers who build, edit, refine, validate, test, QA, and ultimately deploy monitored software — which then generates more signals. So there's this giant feedback loop that's extremely human-driven right now. Some people are starting to instrument the whole thing end-to-end with AI, but the challenge almost everyone faces is that this is a totally different problem from adopting agents — and it requires a fundamental reframing of how your company thinks about building software.
1:17:43
Eno Reyes: How do we set goals? What are we optimizing for? What should our software evolve into? We think this is a massive challenge that people are underestimating. So we're trying to build that initial layer to help enable it — knowing we can't snap our fingers and have the world's Fortune 500s suddenly flip to this new model. And of course, we're thinking about the human role in this new system, because I think it's uncreative to imagine that this system appears and humans simply aren't useful anymore. That's clearly not the direction things are going.
1:18:50
Prakash: One of the things I've noticed on the timeline is that Factory's autonomous agents produce a lot of code — a lot of code reviews, a lot of tokens. How do you deal with the bottleneck of the human having to be responsible for all of that work at the end of the day?
1:19:28
Eno Reyes: This is a fundamental question for almost every industry — as we delegate more responsibilities to AI systems, there comes a threshold where you need to know the work is being produced at a level of reliability and quality that you can confidently trust. I think about self-driving cars: they had this high bar where even a couple of years ago they were trending toward being not just a little better but meaningfully better than the average human driver. But that was necessary but insufficient to just flip every car on the road to be self-driving.
1:20:13
Eno Reyes: For software development, we actually don't have a rigorous history of reliably measuring what the equivalent of 'fatalities on the road' is. You can try to measure bugs, incidents, times you shipped something that clearly didn't work — but all the feedback loops and cycles are just too long to draw meaningful inferences at the moment that matters, which is basically at the time of generation or when you're reviewing something.
1:20:58
Eno Reyes: A lot of our effort right now is going into being really analytical about tracing: is this a bug? Was it written by a human or an AI system? Could we have caught it with deterministic or non-deterministic checks? And then starting to build out a metric — here's where humans are, what will it take for AI systems to get to the midpoint of incidents caused by AI? I think humans would be very willing to cosign these systems, especially if they've played the role of refining the guardrails and guidelines the system operates by.
1:21:43
Eno Reyes: At some point it's not that different from a manager or VP persona saying, look, I built the system and I take responsibility for the outcomes it achieves. That's how a lot of executives are judged — did you drive the outcome, regardless of the mechanics? I think we'll get to that point with a lot of software systems. But as anyone who's used these tools knows, things can go wrong in subtle ways over a long time period — swap accumulation, tech debt. And this challenge is ultimately not solvable just by a smarter model. That's probably not the right way to think about how this gets solved.
1:24:04
Nathan Labenz: So — the agent problem. You last said this can't be solved by a smarter model. I think Claude Fable might like to have a word. How would you interpret the Frontier Code results? My sense was that for roughly a doubling of price you could get more than two x the success rate — it went from around 10% with the latest Opus to upwards of 25% with Fable.
1:24:49
Nathan Labenz: And it seemed like the motivating observation was: a lot of sweeping benchmarks, even the hard ones, can pass tests, but the maintainer wouldn't actually merge the code because it's not clean, not maintainable, not organized the right way, doesn't follow the standards. It was calling out a number of the things you were emphasizing — tech debt and code slop — where tests passed but we don't really like it. Seemed like Fable makes a big move in that direction. Do you read it the same way?
1:25:27
Eno Reyes: Totally. Let's frame what's actually happening when we say Fable outperforms on Frontier Code. Frontier Code is a great benchmark — I'm glad the Cognition team is thinking through how to measure on more novel and difficult problems. We need more of those. There's another great benchmark called ProgBench that looks at reverse engineering on extremely hard problems — the pass rate there is effectively zero. We have internal benchmarks with 0% pass rates too. These new benchmarks are great when we introduce them.
1:26:12
Eno Reyes: But if you think about what it means to score on a benchmark — correctness is assessed by running tests, by LLM judges, by novel verifiers specific to the problem. Basically, when somebody spends 40-plus hours creating a verification for a single code change, you can then reliably evaluate if the model was good at that problem. That's totally reasonable — but what it translates to is that in the real world, the challenge is often not 'can the model write code that works?' It's basically every other aspect: can I trust that this model output code that works? Does the codebase have the deterministic feedback loops to get to that correctness level?
1:26:58
Eno Reyes: The repositories in that benchmark are all very well-tested, well-known open-source codebases where maintainers approved the submissions. The level of what we'd call 'agent readiness' in open-source actually tends to be much higher than in enterprises — which makes sense. You're accepting changes from the outside world, from random people. How different is that from coding agents where you're getting changes you sort of lightly asked for from a black-box generator? Open-source maintainers have gone through the rigor to add deterministic verification and validation loops into their systems.
1:27:43
Eno Reyes: That's how Fable got such a high score: it ran the tests, ran the linters, applied type checking more rigorously, used all those tools to hill-climb its way to high success. And if you don't have those things, you're stuck regardless of the model. What we'd argue is that all of these pieces are part of the puzzle — you can't just drop in a great model, and you can't have agent readiness with a bad model. You need to invest in upgrading the deterministic feedback loops. You have to upgrade your mental model of this because it's a risk question: humans have to decide, at some point, that they're going to start accepting code changes they haven't read. And then — yes — you do need great models.
1:28:28
Eno Reyes: So I'd argue that Claude Opus 4 was already sufficient enough for going full-auto. The models have been sufficient for a while. All these other things need to catch up in order to take advantage of them. And basically, the gains we see in models today are primarily coming from models getting better at achieving correctness without relying on those verification loops — the way humans do.
1:29:00
Prakash: One thing that struck me: if you go into a traditional software organization and try to switch them to test-driven development, you encounter enormous resistance — it requires a lot of refactoring of legacy code, and CTOs often don't see that as a high enough priority to commit resources. Do you find that with Factory, customers are actually able to engage in that refactoring more wholesale? And how does the interplay between a legacy codebase, test-driven development, and your AI agents work?
1:30:03
Eno Reyes: Totally. The general thing we see is that the stakes have risen of what happens when you don't enter these mindsets. I'd argue that three to five years ago, the idea of end-to-end tests that actually check application performance and block PRs if you don't hit some bar — the stakes for that were much lower. Humans have to think through this problem, and when you take a hundred humans, we're not the best at maintaining consistency on one specific task like performance optimization. So three to five years ago, if a CTO walked in and said 'we're requiring everyone to hit ten arbitrary quality bars,' everyone would revolt.
1:30:48
Eno Reyes: They'd say, look, we can't consistently keep up with that level of rigor on tasks where it's not even obvious it helps — I'm making a change to a front-end button, making it round versus square is simply not going to break our performance. And that's probably true 99% of the time. Humans are clever; we apply judgment and get around these rules. But today, agents are not humans. They act in a very different way. It's a challenge for everyone to build a theory of mind for how these systems operate — to understand intuitively when an LLM will zig versus zag when you ask it to do things.
1:31:33
Eno Reyes: We don't have that same permissibility of letting things slip, which makes the need for these guardrails and verification loops much higher. So here's the formula we recommend: first, understand where you are in your agent-readiness journey — take stock of your deterministic feedback loops. Then you can start to bring in automations: code review, security review, QA. These are pretty easy — a lot of companies already have some degree of either homegrown or out-of-the-box solutions for these workflows. You can then hill-climb on that agent readiness.
1:32:18
Eno Reyes: You don't need a super high degree of agent readiness to get productivity out of agents — at basically every stage, agents can be useful. But at larger scale, if you've got 45,000 people, you notice that those on very high agent-ready codebases are just ripping — they can say 'please turn my natural language into software that works,' and it consistently delivers the outcomes they expect. Whereas people on very low agent-ready codebases are struggling, wasting tokens, spending huge amounts of money asymmetric to the rest of the org, and not even seeing the outcomes they care about. Agent readiness is a big explanatory variable for cost as well.
1:33:23
Nathan Labenz: All of that makes a lot of sense. I sort of struggle to envision where this is all going, so I want your take. We've just seen Cursor's acquisition at $60 billion — congratulations on your recent fundraise putting Factory in the unicorn club as well. Are the players in this space converging or diverging in their visions, their products, the role they imagine humans playing? How much do you think you and the Bolts, Cognitions, and Cursors of the world are all headed to ultimately the same destination versus sufficiently different ones that there's a place for everybody?
1:34:23
Eno Reyes: This is really interesting — and I can't always speak to the true north star of other players. There's clearly underlying infrastructure that's the same for anyone who wants to seriously operate in the space: you need an incredible agent harness. There's an open question of whether it has to be your own harness. We'd argue yes. Not having model independence is extremely risky, and off-the-shelf harnesses black-box certain activities in ways that make it hard to run an independent business. I believe we are aggregators of models as a fundamental commodity — the intelligence layer.
1:35:09
Eno Reyes: If you're not able to optimize your harness to take advantage of aggregated models, you'll be in this weird place where you're routing intelligence you don't really control. Whereas at the model layer, you actually do control the intelligence regardless of whether that model comes from this data center or that data center — as long as you have access to the models. So there are these base layers of infrastructure where everyone converges, but ultimately everyone has a harness and everyone's building dev sandboxes and automations.
1:35:54
Eno Reyes: Here's our take on where Factory is specifically. In a couple of years, we'll be setting the trajectory of these feedback loops with very high-level goals, and we'll be setting constraints and budgets — essentially looking like VCs or capital allocators. And I think the different strategies capital allocators take today give you a picture of what software organizations will look like. You'll have people VC-ing it: betting on a basket of products, allocating compute, building guardrails and theses around what their software should evolve into, doubling down on winners.
1:36:39
Eno Reyes: You'll see Berkshire-style operators: only looking at well-known, repeatable, somewhat boring software businesses, using scale to accumulate steady gains. You'll have boutiques that make one piece of software really, really well — maybe that's the one-person billion-dollar company. Maybe you can be one person maintaining a micro software factory where your goals are some combination of revenue. But the reality today is that not every company has their 'Google number' — Google's famous metric where a ten-millisecond reduction in page load led to a hundred million dollars. Very few businesses know where their engineering leverage can be pressed on so that money goes in and outcomes come out the other side.
1:38:09
Eno Reyes: If you can identify those places, you're in a great position for a massively auto-optimizing software factory system. You can say, here are my goals, here's the lever I want to press. We want to be the infrastructure that helps all of these unique business profiles assemble their software factory. We can't know every piece of what you'll build, but we know the primitives — and making it easy for organizations to transform into this shape is a really hard problem. That's probably where the majority of the value will be. Letting new companies build with this model will actually be fairly easy.
1:38:50
Prakash: As CTO you're also on the buying side — purchasing products from other firms. What kinds of products do you want to see in the market that agents are going to need in the future?
1:39:30
Eno Reyes: A lot of this is quite interesting. The things we buy today are not necessarily what I want more of. Closed systems of record — anything with a closed system of record, we simply don't want to do the necessary work to rebuild the workflow that Salesforce has trained into all our account executives. So we buy Salesforce. Slack is a great system of record for storing messages; it doesn't make sense to replace it, and they have a network effect via Slack Connect. I don't know why anyone would try to replace these things.
1:40:15
Eno Reyes: What I'd love to see more of are things like what ElevenLabs produces — deep-tech hard problems that connect into agents and software to solve fundamentally hard research challenges that enable a new class of problem solving. The more I see tools like that — people who specialize in unlocking new modalities, or who build hard tech that enables new ways for humans to interact with computers — that's the stuff we'd be willing to spend huge amounts of money on. If somebody came out with a new device that was obviously a better way to interact with Droid for computing, we would bet the house on that device.
1:41:01
Eno Reyes: I think that's sort of where the world is going: there's just way less low-hanging-fruit SaaS, and now we're moving into a world where some more fundamental deep tech or research is required to unlock a really wide variety of value from different players in the AI space.
1:41:29
Nathan Labenz: This has been a great conversation, and we'll use ElevenLabs to voice-isolate the first section where we had a little audio difficulty, so the highlights version will have polished sound. You're a very compelling and clear communicator for this vision. We'll have to survey some of the other players in the space to see whether they're converging on similar ideas or going off in sufficiently different directions. Congratulations on your success so far — when you've got a big update or something to share, let us know, and we'd love to have you back.
1:42:26Interview30 min
Intent Recovery and the Future of Code — Andrey Breslav, CodeSpeakAndrey BreslavKotlin creator Andrey Breslav explained CodeSpeak's core thesis: the conversation with the agent is already the specification, but intent evaporates from repos because only code gets committed. CodeSpeak extracts a delta-based living specification from conversation history, making intent the first-class artifact — and Breslav argued the fundamentals of software engineering remain distinctly human regardless of how capable models become.
Watch
As aired
Prakash opens by introducing Andrey Breslav — creator of Kotlin, co-founder of Alter, and now founder of CodeSpeak — framing the core problem: as developers rely increasingly on AI-generated code, their conversations with models disappear while only the produced code persists in the repo. Nathan picks up the thread immediately, describing the chaos of 20 open terminal tabs and sessions where original intent evaporates, and invites Andrey to explain CodeSpeak's foundational concept of intent recovery.
Andrey lays out the thesis: vibe coding is a step on a longer journey, not the destination, and CodeSpeak sits at the heart of agentic engineering — software engineering minus writing code. He traces the core asymmetry: developers speak to machines in natural language but share only the machine-language output (code) with their human teammates. CodeSpeak addresses this by parsing the full conversation history, treating each message as a delta to a living requirements set, and distilling a compressed, durable specification of intent. That spec becomes the first-class artifact instead of the generated code. Prakash probes the misspecification problem — what happens when the human themselves asks for the wrong thing — and Andrey explains how the delta-based model handles mid-stream course corrections naturally.
The conversation broadens to ecosystem positioning. Nathan asks whether CodeSpeak is converging on or diverging from the end-to-end software factory model; Andrey places a deliberate bet on helping humans navigate complexity rather than chasing full autonomy, arguing that the fundamentals of software engineering (DRY, KISS, separation of concerns) are human cognitive tools that will remain necessary regardless of how capable models become. He describes the integration roadmap — from standalone spec-generation today to seamless hooks into Claude Code — and speculates on extensions to legal documents. Prakash closes with a personal question about how a foundational language designer feels watching code be handed to models; Andrey offers a measured, optimistic read: he never romanticized writing assembly by hand and is glad to focus on the hard, high-level engineering that remains a distinctly human job.
Key moments
You're talking to a machine in a human language, but talking to your colleagues on the team in machine language. That makes not very much sense.
Andrey Breslav1:46:19
I don't know what kind of models we get in five years — nobody does. One thing I know is what kind of humans we get in five years. It'll be the same kind of humans. We'll be as smart or as limited as we are today.
Andrey Breslav1:52:13
What we're paid for as engineers is organizing complexity, and I hope that's the job we keep — as opposed to typing machine commands.
Andrey Breslav1:55:14
Questions asked
1:44:49What is CodeSpeak and what is "intent recovery"?
CodeSpeak is built on the premise that vibe coding is a step on a longer journey, not the destination. Its core formula is "software engineering minus writing code." Intent recovery addresses the asymmetry where developers communicate with AI in natural language but share only machine-language code with their human teammates. CodeSpeak parses the full conversation history, treating each message as a delta to a living requirements set, and distills it into a compressed specification. That spec — the essence of what the developer actually wanted — becomes the durable, version-controlled artifact instead of the generated code. The input (the conversation) is typically many times smaller than the output (the code), making it a far more efficient representation.
1:51:08Why does Andrey believe helping humans is a safer bet than building the end-to-end autonomous software factory?
Andrey argues that the fully autonomous path has too little constraining reality right now — we don't know what end state will work because agents still can't complete serious tasks on their own, despite public claims. By contrast, betting on human-AI collaboration has clear constraints: humans in five years will have the same cognitive profile as today. Complex systems will always require teams, high-information communication, and the classic engineering fundamentals (DRY, KISS, SOLID, separation of concerns) because those are tools for human cognition, not language-specific constructs. CodeSpeak's goal is to raise the level of abstraction from programming languages to human language, while preserving the full engineering discipline that humans need to organize complexity.
1:56:38How does CodeSpeak handle misspecification — when the human asks for the wrong thing?
CodeSpeak treats every message as a delta to a running requirements set. When a developer looks at a prototype and says "that's not what I wanted," the next message is interpreted in the full context of all prior conversation. The system extracts which requirements are kept, which are modified, and which are dropped, generating a diff of the requirements. From that point forward, the updated requirement set is what the system acts on. This also prevents the common problem of agents breaking working features in new sessions by tying requirements to git history — if you're on the main branch, all requirements that went into main are automatically in context for any new session.
2:00:30How does CodeSpeak integrate into developer workflows today, and what's the roadmap?
The current version is a standalone tool that generates a Markdown file of requirements which developers can place next to their CLAUDE.md. The next version will hook seamlessly into Claude Code — adding requirements on the fly and updating its understanding continuously without requiring manual intervention. Beyond the core agent hook, Andrey sees potential in IDE integrations (surfacing requirements contextually next to relevant code), web framework plugins (clicking around a running app to surface requirements for the component on screen), and browser plugins. The goal in each case is to scope the requirement surface to the natural context the developer is already in, rather than surfacing thousands of requirements at once.
2:10:26How does Andrey feel, as Kotlin's creator, about watching code itself being handed over to models?
He is cautiously optimistic. He remembers the UML era — an earlier wave of hype about abstraction that genuinely didn't deliver — and notes that the current AI tooling actually works and creates real value. But he's also seen how much projection and fantasy accompany such waves. In practice, software engineers still struggle to measurably improve their own productivity, let alone delegate everything. His personal analogy: he never cared about writing assembly by hand, so he doesn't mourn losing low-level coding. He expects AI to take over low-level tasks, leaving humans to do high-level engineering — which has always been the genuinely hard part and the part he wants to focus on.
Related
CodeSpeak ↗Andrey Breslav on X ↗
Full transcriptLightly edited · timestamps jump to YouTube
1:42:26
Prakash: Andrey Breslav is a foundational figure in modern software engineering. He's best known as the creator and lead language designer of Kotlin, a programming language now used by over 7 million developers and officially adopted by Google as the primary standard for Android development. After steering the Kotlin project at JetBrains for over a decade and subsequently co-founding the mental-health platform Alter, he's now focused on the next major paradigm shift in computing. He's the founder of CodeSpeak, a system explicitly designed for the era of AI-generated software. As developers increasingly rely on vibe coding — chatting with models like Claude to rapidly generate prototypes — they frequently produce unmaintainable, undocumented code bases.
1:43:11
Prakash: CodeSpeak acts as a new kind of compiler to solve this crisis. It extracts human intent from ephemeral chat sessions and translates it into durable, plain-English specifications, while the LLM generates and maintains the underlying implementation. Andrey, welcome to the show.
1:43:32
Andrey Breslav: Thank you very much for that introduction. I'm very glad to be here.
1:43:38
Nathan Labenz: What really caught my eye studying up on CodeSpeak — and I'll give you a chance to introduce the company the way you want to — is something that I think is maybe a newer product surface for you, but it resonates with an experience I have constantly. I'm up to 20 terminal tabs open, maybe 10 messages deep on average in each, and I can't quite remember what we actually implemented, which sessions I closed mid-stream, or any number of things where I just lost track of my original intent. So this idea
1:44:23
Nathan Labenz: that human intent is all that matters, and the call to action around intent recovery, is a really compelling one even for a solo explorer like myself — I'm not even trying to ship enterprise software. So tell us what you think is most important here. I'm genuinely excited to hear about intent recovery.
1:44:49
Andrey Breslav: The foundational idea behind CodeSpeak from the beginning was that vibe coding is not the destination — it's a step on the journey, and there will be future steps. We want to build one of those future steps. When we started, the term "agentic engineering" wasn't really popular yet, but the further we go, the more we can see that CodeSpeak is at the heart of that idea. When we were starting, I wrote down this formula: CodeSpeak equals software engineering minus writing code. We want to keep all the engineering aspects, but humans shouldn't be writing code manually anymore. The intent recovery idea is pretty fundamental, because right now everyone who prompts agents to get working code is doing work that gets partly accepted and translated into code — but the rest is being discarded. And there's this unfair situation where you're talking to your agent in natural language, getting code, and checking that code into a repo. If you're working in a team, other people check out your code, but not the human language — just the code. So you're talking to a machine
1:46:19
Andrey Breslav: in a human language, but talking to your colleagues in machine language. That doesn't make much sense. It's obvious that there has to be a next level where we all work in a reasonably high-level language close to human language. The simple observation behind what we're doing right now at CodeSpeak is that you already wrote those words down — you may have been speaking into a microphone, doesn't matter. The words happened, and those words were enough to create the code. That input determined the output. It might have been a back-and-forth with some testing, but all that input is what
1:47:04
Andrey Breslav: determined the code. So that input is enough to describe the code — and most of the time it's many times smaller than the output. Even just replacing the code with that input would be really valuable. But when you're working with an agent, you change your mind. You're sort of extracting and realizing your intent as you go. So it doesn't really make sense to just read all your messages top to bottom — you need to compress them. If you changed your mind, you need the most up-to-date version. And this is what we do: we look at the conversation,
1:47:49
Andrey Breslav: and to simplify — we look at your messages and create a specification from that. We extract requirements from what you were communicating: what you requested, what you flagged as errors (which is the flip side of a requirement). We put together a list of things you care about that determine the actual output. If another person — or you later — is looking at this code and has that set of requirements next to it, you have a very concise representation of what the code actually does. And you can imagine this happening across multiple people working in their own branches. Then,
1:48:34
Andrey Breslav: when you merge or submit a pull request, you can review those requirements instead of the code — because the code wasn't written by you anyway. What actually comes from a human is the requirements. This is how we elevate what we do to that level, and this is what we call intent recovery. We take a messy session where you changed your mind a lot and did a lot of back-and-forth, and we compress it and distill the requirements from it. That's your intent — the essence of what you were doing. And the next step is: if you already have requirements for existing code and you want to change it, the easiest way is to change the requirements instead of just prompting with all the surrounding context. Take what exists as a set of requirements, make the changes, and say "implement." That's the overall idea of what we're working on.
1:49:36
Nathan Labenz: The observation that you're speaking to your computer in human language, it's translating that into code, and then your human teammates are primarily just getting the code without access to the conversation that led to it — it's obvious in retrospect, as many of the best observations are, but it's an arresting one. It's a very strange place we've kind of vibed ourselves into. Can you — you were here for the last section of our conversation with the previous guest, and I wonder what your take is on some of the same questions around convergence and divergence.
1:50:23
Nathan Labenz: I didn't catch any fundraising news from CodeSpeak, so I'm curious how you think about your place in the ecosystem. Is this something you imagine maturing into another do-it-all platform that ultimately automates a lot of the work and puts people in a managerial role? Or do you have a different vision for the end state — maybe a more durable niche that you see yourselves occupying while everyone else competes to be the full end-to-end software factory?
1:51:08
Andrey Breslav: There are different bets people make in this space. Some people make a very safe bet that works for very few — if you can make very good LLMs, go and make very good LLMs. Very few people can do that. Some people are trying to do the end-to-end thing, which I think has merit, but also a lot of uncertainty, because we don't really know what will work. And we know that right now we are very far from full autonomy in agents — they can't really complete serious things on their own, no matter what people claim.
1:51:54
Andrey Breslav: I've talked to some people who make public claims, and in private there's a lot of nuance to those claims. This isn't to say agents aren't useful — they are, and I'm not trying to code by hand. But that doesn't mean they can do everything autonomously. That's a big challenge. What I
1:52:13
Andrey Breslav: am wary of is that there isn't enough constraining reality to figure out what the end game will be for the fully autonomous path. In what we do, I see a lot more constraining reality, because what I'm trying to make CodeSpeak into is a tool that helps humans in a world of coding machines. My vision is actually pretty straightforward: I don't know what kind of models we get in five years — nobody does. They may be considerably smarter, or roughly as smart as today. One thing I do know is what kind of humans we get in five years. It'll be the same kind of humans. We'll be as smart or as limited as we are today.
1:52:58
Andrey Breslav: So I think the bet on helping humans is a much safer one. What I'm trying to do is help humans navigate the complexity of the systems we build — assuming that humans own the intent and make the decisions. If you're building a complex system, you're not getting away with a short prompt. It's going to be a lot of information you need to communicate, and you won't be working alone. You'll be a team. And we know from years and years of software engineering that organizing
1:53:44
Andrey Breslav: a complex description of a system is a challenging task — and that's what software engineering is about. If you think back to the early days of computing, people wrote machine code and assembly by hand. That was the hard part. Then we got C, then Java, then Kotlin — the level of abstraction kept rising. But some things stuck around the whole time: DRY, KISS, SOLID. The basics of software engineering were always there, because it's not about the language — it's about what kind
1:54:29
Andrey Breslav: of being you are. You're a human being. You need these tools to navigate complexity. And I believe we'll all need the same things in the future — we'll be able to manage more complexity overall, so it may even be more challenging, but the fundamentals will be the same. This is why I believe in agentic engineering as an engineering discipline, and why we're building something that first elevates the level of abstraction — replacing programming languages with human language, hopefully — and then introduces all the same tools software engineering has always had: modules, abstractions, separation of concerns, vocabulary that makes systems understandable to humans. Machines won't do that for you; creating that language is equally important to using it. So it's inherently a human job. What we're paid for as engineers is organizing complexity, and I hope that's the job we keep.
1:55:48
Prakash: Concretely, as I understand it, CodeSpeak has an intent extraction pipeline: you start with the human's messages, extract intent, structure it into specifications, and you carefully prune specifications that don't trace back to human intent. How do you handle the case where the human themselves misspecifies? This happens to me all the time — I misspecify, then have to look at the prototype before realizing, "that's not what I wanted." How do you separate misspecification on the human's part from
1:56:33
Prakash: misspecification at the LLM level?
1:56:38
Andrey Breslav: There are two kinds of misspecification. One is self-contradictory: you request things that can't be done, or you're contradicting yourself with inconsistent requirements. But I think what you're asking about is the more common honest mistake — you don't really know what you want until you see what you got. In my experience, the next step after that is looking at the prototype and saying, "this isn't what I wanted — I want to change it this way." And this is where
1:57:24
Andrey Breslav: our delta-based approach to requirements helps. We treat every message as a delta to your requirements. When we look at the next message, we take it in the context of the entire prior conversation and try to extract what requirements are kept, what's changed, and what's been dropped. So if you say, "this isn't what I wanted — let's change it to something else," we create a diff in your requirements and say, from this point on, these are the requirements you actually care about.
1:57:57
Prakash: I can already see that helping. On every vibe-coded project, as you try to push to production you start seeing all these issues and thinking, "I didn't mean that — I told you this like 10 commits ago."
1:58:16
Andrey Breslav: Right. And there's another aspect here. It's a well-known problem that agents will often break working code while building something new — either they forgot, or it's a different session that never knew a certain feature existed. I can tell an agent to make everything work on web and mobile, then turn to a new feature, and that feature breaks all the mobile assumptions because I didn't mention mobile in the new prompt. We can prevent that by keeping all the actual requirements around and tying them to git history. If you're standing on the main branch, all the requirements that went into main will be in context for whatever new session runs on top of that branch. I find this very helpful. There's also something we're working on right now that surfaces requirement deltas every time you submit a new prompt. We're actually reworking the CodeSpeak internals substantially — the version accessible on the website now is a previous generation. We're doing a very interesting rework specifically around git history matching, and making things a lot faster. And it also helps with
1:59:46
Andrey Breslav: other things — like onboarding a new person. When I take over someone's existing vibe-coded project, I can't really browse the code anymore; the code isn't really browsable. I can talk to Claude, of course, but it's much better to know exactly what my colleague meant when they built it. Looking at the requirements is incredibly helpful.
2:00:14
Nathan Labenz: How does this actually get integrated into workflows? Are you using Claude Code and this is a hook that ties in? I'm sure there are multiple ways. How are you doing it such that I might learn from your example?
2:00:30
Andrey Breslav: The previous version is a standalone tool — it generates a Markdown file with all your requirements, and you can put that next to your CLAUDE.md or wherever. In the next version we'll make it more seamless so it hooks into Claude Code: it adds requirements on the fly, and it updates its understanding of requirements continuously without you really intervening. The difference is that you can reconstruct retrospectively by looking at git history and session history very carefully and matching everything together — it's about 99% accurate — but it's much easier to just be in constant dialogue and keep
2:01:15
Andrey Breslav: track of all the requirements in real time. And it's also more helpful because the requirements surface at every prompt, so Claude is aware of them.
2:01:24
Nathan Labenz: Do you think this has application beyond software? I'm struck by the fact that legislation sounds like exactly the sort of thing
2:01:34
Nathan Labenz: that could really benefit from a good intent history log.
2:01:39
Andrey Breslav: Yes. It's my pet peeve that we should have been doing law in code for a pretty long time by now. I think we'll get to a point where we can turn legal documents into something very close to code — and importantly, any contract or agreement should be executable. You should be able to run a test: given these circumstances, what does this contract entail? It's tricky to make that work, but if CodeSpeak is very successful I think we can extend what we're doing to that domain as well.
2:02:24
Andrey Breslav: In terms of form factors more generally — integrating into existing agents like Claude Code or Codex is one path. I think there's also interesting potential in integrating into IDEs, because requirements are structured and always around — you can surface them in different ways. In an agent session you have linear, historic context. But in the context of code you have structured context: different components, different levels of abstraction. It's potentially very helpful to surface requirements in relation to the code you're looking at — you may not
2:03:09
Andrey Breslav: even care what the code says; it just gives you the context and the level of abstraction you want to surface requirements for. For a sizable project there can be thousands and thousands of requirements — you don't want to read them all. But for a human, surfacing requirements scoped to a given code context is really helpful. We also have in the works a plugin for a web framework that lets you click around in your web app and surface requirements relating to the specific component on screen. So whenever you
2:03:54
Andrey Breslav: can scope your requirements through some natural context a person is already in, it's very helpful — you don't have to surface thousands of requirements at once. IDEs, developer tools for web frameworks, a browser plugin — any natural context the developer is already in.
2:04:20
Nathan Labenz: What have you noticed about the difference in model character? You mentioned Claude doesn't want to read all those requirements. This calls to mind various compare-and-contrast exercises where people say Claude is a bit more human-like, maybe gets the human condition a little better; Codex is maybe more diligent, tighter on instruction-following, but perhaps less likely to recognize when what you said isn't exactly what you meant. From your particular angle on extracting value from models,
2:05:05
Nathan Labenz: what unique takeaways have you noticed? And are you using models in complementary ways, or do you find one is simply the best for what you need?
2:05:18
Andrey Breslav: I'm trying to evaluate more models but it's extra work, so I'm a little behind personally — other people on the team are looking into different agents and models. One helpful pattern is to use one agent as an MCP tool for another: if your primary agent is Claude, you can use Codex as an MCP tool to review Claude's work. It's not a hard proven fact that it works better than just using another Claude instance for the same thing, but arguably they have pretty distinct training sets and can offer different perspectives. Personally, I don't care very much about response style — I always try to get responses that are as concise and robotic as possible, so I'm not super sensitive to that. I have style guidelines I bake into every agent I work with. There are also important characteristics around latency — for some things
2:06:05
Andrey Breslav: latency is a very important problem because we want to make a lot of small requests. If you go to Opus with every one of them it's going to take ages, and this is why the old version of CodeSpeak is somewhat slow. We're looking into combining different models — sometimes Gemini Flash, which is considerably faster but less versatile in some respects; sometimes a smaller Claude model. So we're looking at these things. We should do more of that. Right now we're in more of a research phase so we don't worry about cost as much, but as we get more users, cost will matter too.
2:07:26
Prakash: Do you think you'll end up using some of the Chinese open-source models? Is that on the horizon?
2:07:33
Andrey Breslav: It could be, yes — Chinese or non-Chinese open-source models. An interesting thing about modern software engineering is that more and more of it is actually machine learning. And in machine learning, data is often more important than code or even algorithms. If you have a very good dataset that captures what you want to do, you can take an open-source model and fine-tune it. The big question is where you get the good dataset — that's really tricky, and we've been doing some work there and will keep doing more. But
2:08:18
Andrey Breslav: at some point — possibly pretty soon — we'll be looking more and more into open-source models, fine-tuning them, or doing other forms of post-training like RL. That can be a big cost saver for our users and also a latency boost. A lot of modern open-source models are pretty good. They're still too big to run locally, but we can host them ourselves.
2:08:56
Nathan Labenz: Fantastic. I'd put you safely in the category of diverging from the end-to-end software factory paradigm — including Factory itself. I look forward to trying CodeSpeak in my own stack and seeing what kind of intent recovery we can glean from the cutting-room floor of our own work. And I'm really looking forward to the day when you achieve the bigger vision of extending this to other areas beyond code, so we can live in a
2:09:35
Nathan Labenz: much more coherent, predictable, high-intent, high-signal legal environment as well.
2:09:44
Andrey Breslav: Yeah, that could take a bit of time, but I'm really looking forward to that world too.
2:09:51
Prakash: One more question. As a truly historical figure in programming, how do you feel about the way software engineering is changing with these models? A lot of software developers seem to have a nostalgia — almost a grief — for an art that's being lost. Emotionally, how do you feel about handing the code over to the model?
2:10:26
Andrey Breslav: I'm actually pretty skeptical that the things people are lamenting are really happening in the way they fear — maybe that's an internal bias helping me stay optimistic. I also remember the UML days; I'm old enough to have caught the tail end of working on UML-related tools. And that was a complete failure — it didn't really work. This current wave works, and it actually delivers value. But I also know how much people can project and fantasize about a bright future in ways that blow things out of proportion.
2:11:11
Andrey Breslav: What I see in reality right now is that software engineers struggle to even improve their own productivity, let alone delegate everything. That will improve — people will figure out how to use these tools, the tools will get better, and we will definitely delegate a lot of low-level engineering tasks to them. But for now, I keep an optimistic outlook on humans remaining engineers. As an engineer, I never cared about writing assembly by hand. Some people enjoy that, and they remain experts with well-paying jobs. But there are few of those people.
2:11:56
Andrey Breslav: Incidentally, there are few of them, and I'm not one of them. I don't care about low-level work — I want to do high-level work. And I think these tools will enable us to do high-level engineering. It's very hard; it's always been very hard. And I'm looking forward to a world where I can really focus on the hard stuff.
2:12:24
Prakash: We're definitely looking forward to that world too. Andrey, thank you so much for your time. We look forward to using CodeSpeak.
2:12:34
Andrey Breslav: Thanks a lot for having me. It was great to chat.
2:12:38
Nathan Labenz: We'll report back. Appreciate it.
2:12:41
Andrey Breslav: Thank you. Bye for now.
2:12:47Closing1 min
CloseNathan closed with breaking news: Dean Ball is joining OpenAI to lead a new AI policy team — prompting a quick exit to connect with Ball for a forthcoming Cognitive Revolution episode.
Watch
As aired
Nathan closes the show with a breaking-news update: Dean Ball, whom he had mentioned earlier in the episode, is joining OpenAI to lead a new team advising the company on AI policy. Nathan notes he needs to run to speak with Ball immediately and is hoping to record a full Cognitive Revolution episode covering the backstory and reasoning behind the move. He wraps quickly, noting it has been a strong week for the show, that they are off the following day, and that a weekend highlights episode is coming soon. Prakash and Nathan exchange brief sign-offs.
Key moments
Dean Ball is joining OpenAI. He'll be leading a brand-new team that will advise OpenAI leadership on steering AI policy in the right direction.
Nathan Labenz2:12:47
Full transcriptLightly edited · timestamps jump to YouTube
2:12:47
Nathan Labenz: Alright, I've got a quick wrap. Dean Ball, whom I mentioned earlier, has just broken into the news — he's joining OpenAI. He'll be leading a brand-new team that will advise OpenAI leadership on steering AI policy in the right direction. I need to go talk to him right now. We're hoping to put together a Cognitive Revolution episode with the full backstory, reasoning, and everything that went into this decision. So I have to be quick on the exit today, but look for that coming soon. This has been another really good week.
2:13:32
Nathan Labenz: We're off tomorrow, but we'll be back next week. And folks can also watch out for a weekend highlights episode coming soon.
2:13:40
Prakash: Alright. Bye bye.
2:13:42
Nathan Labenz: Thanks, Prakash. Bye for now.

The open — AI dividends, talent war, and impossible guardrails

Midjourney founder David Holz's announcement of a full-body ultrasonic CT scanner — 60-second whole-body scans at minimal cost, a fleet of 50,000 units targeting a billion scans a month — anchored a discussion about founders deploying AI profits into physical-world goods in ways institutional capital structurally cannot. Nathan connected the announcement to his son's cancer experience and the coming wave of diagnostic abundance. Noam Shazeer's departure from Google to join OpenAI read as a mission-over-money signal about recursive self-improvement timelines. And the government's demand for "uncircumventable" guardrails before Fable's return prompted both hosts to note the equivalence with demanding bug-free code — technically incoherent, potentially functioning as a de facto indefinite ban.

Neglected approaches to alignment — Judd Rosenblatt

Judd Rosenblatt previewed AE Studio's forthcoming gradient routing research: a pretraining technique that channels dangerous capabilities (CBRN, cyber) into dedicated expert modules in mixture-of-experts models, which can then be ablated from the public-facing model — addressing the root cause that post-training-only safety filters cannot. He argued the alignment community's political monoculture (less than 2% right-of-center in AE Studio's surveys) prevents it from building an accurate model of the administration's perspective, and urged a steel-man reading: the Trump administration taking AI risk seriously for the first time is a positive development, not a hostile one. His call to action was more simultaneous, ambitious alignment R&D — trivially cheap relative to compute investment — and evaluation regimes that stay current with technical developments.

Software factories and intent recovery — Eno Reyes and Andrey Breslav

Eno Reyes laid out Factory's software-factory vision: a giant feedback loop from world signals through prioritization, development, and deployment, increasingly instrumentable end-to-end with AI. He argued Fable's strong Frontier Code performance actually proves his point — the model won by leaning on tests, linters, and type-checkers already baked into well-maintained open-source repos, which is precisely the deterministic feedback-loop investment enterprises need to make regardless of model capability. His competitive thesis: model independence is non-negotiable; anyone routing intelligence through a black-box third-party harness cannot run an independent business.

Andrey Breslav brought the language-designer's vantage to the same question. CodeSpeak's foundational observation is that developers already write the words that determine the code — the conversation with the agent is the specification — but that intent evaporates from the repo the moment only code is committed. CodeSpeak extracts requirements from conversation history as a delta-based living specification, making intent the first-class artifact and code the disposable artifact. Breslav placed his bet on helping humans navigate complexity rather than chasing full autonomy, arguing that DRY, KISS, and separation of concerns are human cognitive tools that will remain necessary regardless of how capable models become.