AI in the AM is a live weekday morning show on AI. Day four's triple-header ran from the AI-run back-office, to AI mental-health support in some of the world's hardest places, to the benchmarks that test whether AI can actually do science — all through a long, on-air bout of live studio debugging.
EPISODE 2026-06-04
AI:AM LIVE — June 4, 2026
A live morning show with Hooman Radfar of Collective on the AI back-office for businesses-of-one, Taras Pohrebniak of Elomia Health on six years of AI mental-health support, and Peter Jansen of Ai2 on measuring whether AI can actually do science.
Episode timeline
Opening — news + discussionA DNA-synthesis screening letter, OpenAI Sites eating the app layer, NVIDIA's Nemotron 3 Ultra vs. the frontier, and a Broadcom earnings flub — through a long stretch of live audio debugging.
AI leaders call for mandatory DNA-synthesis screening. A coalition letter asks Congress to require screening of DNA-synthesis orders against known-pathogen databases (and, Nathan expects, frontier-model guesses for evasion attempts) — codifying what two executive orders already encouraged. Nathan backed it strongly ("a pandemic is really hard to put back in the bottle") and tied it to the coming "Mythos moment for biotech." Prakash, a self-described accelerationist, said reading the history of technologically sophisticated terrorism — the Aum Shinrikyo sarin attack, run like a private-equity roll-up of chemical plants — moved him toward supporting controls on bulk industrial capability. Both framed it as a high-value choke point for under-resourced lone actors.
I am honored to have signed on to this letter. This is an urgent priority for near-term action by Congress. Biotech is advancing rapidly on its own, and I—and many others—believe the “Mythos moment” in AI/bio is coming soon. It is time for action.
Sam Altman, Dario Amodei, Demis Hassabis and many others have signed a letter urging Congress to increase security on orders of synthetic nucleic acids - and the equipment needed to make them - as models continue to become increasingly bio-capable.
OpenAI Sites: Codex stops writing code and starts shipping products. OpenAI's Sites turns Codex from an agent that edits code into one that deploys hosted websites, dashboards, and apps from a prompt, with access control. Nathan read it as the familiar platform playbook (Facebook letting users do the R&D, then picking the low-hanging fruit that works) and reiterated he's not bullish on the app layer: "I have a real hard time seeing" sustainable margins for frontier-model wrappers when "I already get this for free in Codex."
Codex app update: Build and deploy websites with Sites What changed: • Sites is now available in preview in the Codex app. • Use the Sites plugin to create, save, deploy, and inspect websites, dashboards, internal tools, web apps, and games hosted by OpenAI. • ChatGPT Show more
NVIDIA ships Nemotron 3 Ultra — fully open, but still behind the frontier. A 550B-parameter MoE built for long-running agents, shipped fully open (weights, synthetic data, and post-training recipes). On the hosts' preferred yardstick — Artificial Analysis's GDPval-style index — Nathan put it around 1448 (roughly the 20th-best model) versus Opus 4.8 with max reasoning at ~1890, a ~450-point gap that translates to roughly a 93% win rate for Opus. His verdict on the "commoditize your complement" play: "I would bet on Anthropic to make a good chip before I would bet on NVIDIA to catch up with Anthropic." Prakash's counter: most everyday tasks don't need frontier intelligence, so a model at 1% of the capability and price can still win NVIDIA a broad customer base — useful, since ~60–70% of NVIDIA's revenue comes from ~5 customers.
Today we're shipping Nemotron 3 Ultra. A 550B MoE frontier-intelligence open model built for long-running agents. It delivers 5x faster inference and lowers the cost of complex agentic tasks by up to 30% versus other open frontier models.
A Broadcom earnings flub erases ~$150B in minutes. Prakash's market open: CEO Hock Tan misread Q2 2025 numbers at the top of the call, the algos reacted instantly, and the stock fell ~15% before he corrected to a record $26B — the damage done. A bank note then forecast Micron margins down 66%, knocking another ~$60B off. His point: these companies are now so large that every 5% move is the size of a new S&P 500 entrant.
Much of the opening doubled as a live demonstration of the show's premise: a long stretch of guest-audio debugging (the first guest could see but not hear the hosts), with Prakash fixing the vibe-coded studio in real time via Codex. The hosts leaned into it — Prakash described having Codex check the live technical state of the stream, "the kind of thing that would require a million-dollar studio," and later (in the third segment) recounted how Codex diagnosed the bug as needing an audio-channel resubscription, which a speaker toggle triggered as a refresh.
On the through-line of margins and concentration: Prakash argued the AGI firms are spreading in every direction — up into the app layer (Sites absorbing Lovable/Replit), down into data centers (per Nadella, they'll build their own clouds) and chips — "like a virus," because a trillion-dollar valuation forces you to either create new value or eat existing value. Nathan agreed the app layer looks structurally hard, while noting the labs' real target is the ~$20T US labor market via a drop-in knowledge worker. The deeper point both kept returning to — and would put to the guests — was the high value of "sufficing" open models versus frontier models for the vast majority of real-world tasks.
TranscriptAuto-transcript, lightly cleaned · timestamps jump to YouTube
16:06Welcome to Let me update the uh Welcome to June 4th and start the segment. So, uh good morning. It is uh Thursday, June 4th, uh 2026. It is the fourth day of our um marathon like daily live streaming. And um we have it's been an interesting uh overnight. So let me just let me just start off with the markets. Uh so overnight um the major market news is uh Broadcom uh had its earnings and uh our friend Hawkan who is the CEO made a made a little bit of a error and he started reading and he read the uh Q2 2025 uh results. So he was like, "We did a max we did a record revenue of $15 billion or something like that." And uh immediately the market tanked because everyone listening in on it is a is a bot. And the bots were like, "Oh my gosh, the revenue is not what we expected." Market tanked. Um uh Broadcom stock down 15%. And this is a you know trillion trillion dollar plus stock.
It's $150 billion of value just erased. And then he's like, you know, and literally it's like one one sentence and then he's like, "Oh, oops. You know, actually it's a Q2 2026 and we had a record, you know, revenue of $26 billion like and it was just too late. The the stock was down 15% uh immediately and um and then later on people tried to make a narrative to justify it saying like oh you know uh Google is diversifying away from you know Broadcom. they're they're going to do chips with MediaTek and uh you know Hawkton's pointing out that Hawkton also pointed out that look our margins are not going to stay this high because the the amount of spending per megawatt doesn't look like it's going to increase but the number of megawatts the number of gigawatts is going to expand dramatically. People are underestimating. So he's like look our market size is getting bigger but our margins are not necessarily getting going to get go go go higher and the margins are probably going to decline.
And so the market kind of justified, you know, the reaction uh with that later
18:37statement, but the damage had been done like just that kind of miscue uh uh you know, coding last year's numbers instead of this year's numbers uh drove the stock down. So uh I thought that was pretty interesting. Uh and our friend uh Jensen Hang uh at um Computex uh Nvidia announced a u 10 or 20% reduction in the memory that they use uh per uh per chip for the uh for the latest for the latest chips. So they they're actively trying to push down uh the amount of memory that they're using, which is kind of what I alluded to yesterday that you know people would start to design around uh around the memory. Um and so that those were the two uh interesting things that happened and as a result um you know immediately what the market does is often like it tries to find ways because traders in the market especially the big banks they want the volatility. They want people to like oh my gosh the stock is down I got to sell. Oh my gosh the stock is up I got to buy right and so as the moment the stock uh the moment there was some sign the stock was going down immediately some bank puts out a note saying that hey you know micron margins are going to decline. And it's going to be it's going to go it's going to go down 66%. So Micron stock is you know somewhere there and they're like it's going to go down 66% in the long in the long run. Uh so immediately Micron is down like 6% today another $60 billion loss. These these companies are so big right now that uh every 5% fluctuation is S&P 500 like new uh you know new company joining. Um and so that is the that is the state of the market uh overnight. Um and just to uh kick us off, I think I will I will uh share the um what we were uh thinking of doing today. This is the uh the DNA screening.
let's There we go. And let me let me pull up the DNA screening. And there we go. So, this was the um top AICO's uh call for law protecting against biological weapons. It sounds so crazy. Uh Nathan, uh I I'm not sure if you if you read this, but what did you have um that you thought about this one?
21:09Studio experience is lagging a little bit. So, just want to make sure I'm coming through » seems like not you. » Yeah, it is. Okay, cool. » Um, yeah. Well, this So, yeah, a couple threads I wanted to follow up there. Sorry, I kind of shorted out on you for a second, I think. » Y » um this call for law protecting against of DNA sequences. I think it's a very good idea and people go to public access and we either don't have a great way to control them or andor haven't hardened the world around them. So my understanding is that this has been done by a executive order in the Biden White House, reinforced with an executive order in the Trump White House, and now, you know, they're they're pushing to take it all the way to force of law. And the basic idea is just that there are still companies out there in the DNA synthesis business who don't bother to screen your order against a database of known pathogens before they go ahead and produce it for you. And I think it's it's ultimately probably going to be, you know, quite a bit more sophisticated than just uh against a known database. You know, it'll also be guesses probably made by Frontier models to make sure that even if you've intentionally tried to get around detection that they can, you know, hopefully detect as well as possible. But this seems like a very good idea. You know, if there's one thing I know for sure, it's a pandemic is really hard to put back in the bottle once it's out. And you know, things like anything that can self-propagate, you know, anything that uh finds a an uncolonized world just waiting for it to take over the world is something that we should definitely be very careful about.
So, I support this initiative and it's kind of striking honestly that it it needs to come to this. It's a little bit like, hey guys, you know, couldn't you have been shamed into doing this on your own? There were already multiple executive orders around it. I guess I don't know for sure how many companies are not doing it. You know, this this may be » codifying something that has become
23:39standard or pretty close to universally standard practice » in light of those executive orders and it's just a matter of shoring it up. That'd be an interesting question for somebody who's deeper in the weeds than I am around, » you know, just how many uh companies are still not doing it. But I think it's absolutely a a good idea and I think we can you know watch this space for all kinds of um stories you know the sort of mythos moment for [snorts] biotech is coming not too long behind the mythos moment for cyber security » indeed um the you know as you know I'm I'm I'm a little bit of an accelerationist uh believing that the future should come faster and one of the things that has changed my opinion over the years is I started reading about um you know how technologically sophisticated terrorism has has worked and to date actually there's only been one group that has been technologically sophisticated and that was the amirin group in Japan and they did the sarin gas attack in the uh Tokyo Tokyo sub uh you know subway killed about I think 47 Those guys were a billiond dollar enterprise. Um, and they ended up using, you know, reading through it, I was like, you know, this is not very dissimilar from what a private equity group would do because they ended up using a bunch of shell companies to acquire um, you know, chemical plants uh, in in Japan. Like so they were you know $und00 million chemical plants that could you know um produce um you know sarin gas uh you know precursors in bulk and they had the engineering talent too you know being being Japan they they had the engineering talent to do this and uh what struck me was that you know you you always think of the al Qaeda types who are you know in like a cave in Afghanistan an best that they could do is that they could come to the US and kind of learn how to fly a plane.
They didn't manufacture the plane, they didn't make it. They exploited uh basically a social loophole there uh in order to get access to these tools and then they um use those tools. Uh and the the the Japanese attack was very different because those guys actually, you know, bought a plant and like modified it. Uh bought the precursors and they did their
26:10actual manufacturing themselves. Um and it struck me that um that kind of attack would be much more easy to carry out if the um if the people carrying it out had were uplifted basically in a sense by by this kind of greater intelligence. So that has kind of changed my uh my viewpoint a little bit and but that strikes me that the answer to that is probably much greater kind of controls and surveillance. Uh which is I think one of these things is uh you know this law protecting against biological weapons is really about um identifying people or labs that could produce in bulk in mass. It's not about stopping the weekend chemist in the in the backyard, but it's about identifying and stopping, you know, people or entities from building up the technological capability and the sophistication and the industrial capacity to actually um carry out a mass attack. Um, and so that that that strikes me as as quite different from just saying like the weekend chemist cannot is not allowed to to do this, you know. So » yeah, I don't know how hard it would be do a DNA synthesis on your own in the garage, so to speak.
It does seem like this is a fairly important choke point. I mean there there's no doubt that if you are wellresourced, motivated, sophisticated, you know, have a coherent team that it would still be possible to do something like this. But it is hard to build those, you know, those organizations. We we do have a lot of surveillance. It is hard to build those organizations without somebody becoming wise to. It's hard to maintain the coherence of a a group that wants to do something like that without somebody deciding they think better of it and going and you know telling law enforcement or whatever. So the this does seem like a pretty important choke point where if you are a lone actor and you're not super well resourced at a pretty affordable price point you could send in a sequence and get a small amount of whatever sequence back and then do something with it. And if you didn't have that, I'm not sure how much harder it would be to to do it on your own, but I think it would be like quite a lot harder. I think synthesizing these DNA sequences is still not
28:41easy. It would take you, you know, a long time. You would probably have a lot lower fidelity if you could manage to do it at all in your garage. So, it does seem like, you know, nothing is foolproof. This of course wouldn't catch every possible scenario, but it does seem like one of those things where you got to get you got to believe you get at least an order of magnitude risk reduction along this specific dimension of somebody wants to cause a pandemic and they have a sequence. Can they actually get that to a live result? if you could actually close down their ability to use the companies that do this on a you know small order basis because most of these things are coming most of these orders are coming from legitimate academic labs right so they're not like buying high scale uh most of the time it seems like I think I think it's a pretty strong idea and I do think you can get a lot of value without having massive surveillance although you know we might end up there we might need to go there for any number of reasons as well. You know, order of magnitude risk reduction in u pandemic preparedness may not be enough where we're going. I was really struck in um the course of my son's cancer treatment. We bought a um EUV light. Uh there's a couple different frequencies that have been demonstrated to work pretty well for killing pathogens. And we took one that you can just shine into ambient space into the hospital room every time we went in there just to try to protect him while he was so vulnerable from whatever, you know, infections that it was flu season and and all that. And nobody at the hospital even recognized what it was or had ever really heard of it. Uh, and we got a couple interesting, you know, questions from people who, you know, were curious enough to ask and their responses were like, "Oh, it seems like that should be in every room." I was like, "Yeah, it really does. We could." And there are other versions that you would maybe put more into like the, you know, the uh forced air systems so that because those are are more harmful to skin, harmful to eyes, but in a confined space, they can still work really effectively.
It really is, I think, a tough commentary on human affairs that so little has been done on this front.
31:11» U, wastewater monitoring is another one in terms of surveillance that really doesn't » harm anyone. You know, it doesn't really restrict your freedom of speech at all for people to be » monitoring the wastewater in major cities. But my understanding is that isn't really happening nearly as much as it probably should be as well. So, » yeah, » I'm glad to see this. I hope this is what, you know, the first in um what'll probably be, you know, a 10 plank uh platform of defense and depth against pandemics.
I didn't enjoy the last one very much, and I I certainly don't think uh the next one will be all that much fun either. So, I I would like to what do they say? Um delay, detect, and defend as much as we possibly can. » Indeed. Speaking of delay, detect and defend. Um, so I'm going to I'm going to uh switch gears uh very quickly to the issue of uh sherlocking. Um so yesterday OpenI dropped uh opening eye sites. Uh you can build uh uh build share uh websites uh directly from um you know chat GBT. So you can basically uh once you build an idea inside chat GBT a revenue dashboard or something like that you can publish it as a mini app and uh you can even do authorization on it. So you can decide uh who gets to see it and um they can come and they can log in they can see it they can use the app and um yeah so this is obviously um something that people have been putting together. uh the few the companies that would be involved are uh lovable on the consumer side and replet on the uh enterprise side. Um so as you as you noted before um what do you feel you know the site what do you think of the site's idea first of all?
Well, it's they're really starting to run the same playbook that we've seen platforms run in the past from Microsoft to Facebook to you name it, right? It's kind of inevitable. I think there as they staff up there's like nobody really to blame and yet it is kind of a gross outcome I think in some ways. Uh but I experienced this with Facebook directly where you know they at some point they were pretty open about like yeah the reason we have this platform is that so all of you can go out and do R&D for us
33:42and take the social risk that we don't really want to take but then when we see things that work like we've got an absolute army of product managers here who are » to varying degrees resting investing or motivated to advance. And regardless, you know, either way, it kind of makes sense for them to pick the lowhanging fruit of I saw something out there in the world that looks like it's working. » Yeah. » So, I mean, that's a little bit unkind, I'd say, to OpenAI. This is like a pretty obvious thing to do. It's not like, you know, if Lovable had never existed, would they still do this? Like, yeah, probably at some point. Um, I wish I had a better answer for why I think they're not they being, you know, the top however many intertwin cap table intertwined AI mega corps aren't going to suck up all the value. Mhm.
» But I, you know, looking back so far on the conversations we've had in this week one of this marathon sprint, I don't think I've really heard much that has convinced me that margins are going to be sustainable. » Mhm. » For people that are building rappers, you know, I guess is still basically what we're talking about around frontier models. I have a real hard time seeing it. Doesn't mean they don't have a business at all, but it does mean it's going to be hard to sustain margins. The barriers to entry there are going to be really low.
» Um, it's going to be really tough to compete with I already get this for free in Codeex or CODMAX or whatever. » So, yeah, I think it's going to be tough. I I can't say I've I haven't been bullish on the app layer for a while and I can't say I'm any more bullish coming out of this week. Uh and these announcements only reinforce that point of view. I think » let me uh segue a little bit there and uh so Satcha um Satcha Nadella yesterday saying um Open AI is let's face it anthropic overtime or OpenAI over time will build their own clouds. It makes sense they would use I'm not saying that they won't use other cloud providers but they will they are going to build their own clouds. And so now you see from that um from the model layer you see going upwards into the app layer but also going downwards into the data center
36:14layer uh Microsoft is uh thinking that you know they are eventually going to end up competing directly with open eye and anthropic on the data center side as well. So uh in terms of this platform, one of the one of the things that you see here is that the platform is not only going upwards and absorbing margin from lovable but also going downwards and absorbing margin from AWS and uh Microsoft and and the rest which I thought which I thought is really a commentary. Um the fact of the matter is in order to get a return on investment on this trillion dollar valuation um you only have two pathways and one is create new value and the two is eat up existing value. So eat up existing value means you either have to go horizontal and like absorb your your competitors in the model layer which means fighting against open AI fighting against anthropic or you have to go vertical and you have to absorb people who are using and people who are providing you services and so I think this is this is where this this kind of mo the the AGI firms are basically like spreading outwards in every direction like vertically upwards vertically downwards horizontally you know against each other um it's just just like a a virus almost, right? It's just it's just spreading in every single, you know, direction. Um, and it's just a commentary that how much uh capital that they have taken in at what price means that, you know, the the urgency to uh grow revenue at all costs is definitely there. Um, and and and you can see that uh you know, on a day-to-day basis, per se.
Yeah, they're going to go down to chips as well, right? I mean, that's um kind of already announced and pretty obvious, but there's really no end to it. Sam Alman was just in Michigan, not too far from where I live, I think yesterday, at the groundbreaking of a gigawatt scale data center, which I don't think has been super wellreceived. I haven't been to, you know, I haven't been involved in the local politics, but there have been some kind of shady maneuvers there where like it's been very um the details of what they're building have been very hard for the public to get access to. A ton of stuff has been redacted. Um it seems like they've kind of gone through some unusual channels to get these things uh
38:44fasttracked, which I'm sympathetic to frustration or objection to that. um if only on the basis that like everybody should probably play by the same rules even if those rules should be changed. » Um but as it stands, you know, there I don't think they're uh super welcome, but yeah, they're going to be building out every layer of the stack. Yeah, » again, I think it just speaks to the the value of the intelligence layer just being so high and who has a better chance at eating the other ones uh opportunity, you know, is it you've got Nvidia yesterday putting out this new model » that let me get the actual number on it here.
Um, it's, you know, 550 billion parameters, whatever. The I'm I'm not quite sure at this point which uh benchmark score, if I had to pick one, would be my absolute go-to, but I think it would be GDP val AA from artificial analysis. I don't know if you have a » different take on this, but interestingly, their latest and greatest model from Nvidia comes in with a score of 1448. And that would put it somewhere in like the range of, you know, maybe the 20th best model out there. Maybe maybe a little higher than that. But for comparison, Opus 48 is at least with Max Reasoning is now 1890.
So, it's like 400 [snorts] and some points ahead, which translates to a pretty high preference rate. I think that's like a let me let me just Google it real quick. ELO score advantage 450 points win rate. Um, yeah. So you're expected to win 93% if you have a 450 point advantage. » Mhm. » So Nvidia is trying to, you know, commoditize their compliments, right, and make a model that they'll give away and then you don't have to pay the anthropic premium and then all the the value acrru to the trip. But » if you're only winning 7% of the time compared to the latest opus, » yeah, » you haven't really succeeded in
41:14commoditizing that compliment. And it does feel to me like I would bet on Anthropic to make a good chip before I would bet on Nvidia to catch up with Anthropic. At this point » I I I I think perhaps the catching up will not be that necessary because you know I end up so I had this experience where uh I was setting up the open claw and I started off with uh codeex as the underlying because opening allows you to use the codeex inside open claw and codex was so so but it was also like I I knew I was coming to the end of my you know opening I had given me a 10 times uh you know bonus for the month so I had like extra tokens and so I was like okay I need to switch this out let's see let's see what happens if I switch it out so I tried uh deepseeek and I asked claude and I asked Chad GBT which one I should use they they told me you know start off with Deepseek so I started off with Deepseek Flash uh V4 flash V4 flash was okay but a little bit so so but some some parts of it were already better than codecs which was interesting. Um, and then after a while I was like, okay, you know what? I I want to switch to something higher. So, I went for the Deepseek uh for uh Max um and a Deep Seek 4 V4 Pro Max. And that is genuinely uh good and is actually I think on some tasks uh better than better than Codeex. Um and especially you know the recommendation given by claude chatgbt was to use four v4 pro max for you know uh difficult coding tasks. I'm not using that for difficult coding tasks. I'm using it for very general kind of API MC MCP usage in order to like extract information and like put in information.
Uh so for codec codecs I'm using for uh complex u you know uh coding. So I think like if you if you start off with a model which people are using for complex coding and then you get a model uh an open source model which you know on that kind of index is maybe only 30 or 40% of that index fine but then if you take that open source model and then you apply it to like a very simple task like organizing your calendar you have this you know kind of overhang of capability on the intelligence side and I think more of the tasks will be in that you pretty simple to do over time I think
43:46especially if we if we really imagine this kind of intelligence explosion and you know in two years you know a a single model will be smarter than all of humanity you don't require that kind of model to for your day-to-day life right and so perhaps we have this kind of edge you know frontier where things are very well-developed and those are people who have frontier level tasks will will need them uh and and you can imagine paying you know whatever amount amount. If you're going to get, you know, a a a a model which is able to um deliver a Nobel Prize every month or whatever, that's that's amazing, right? But, uh day-to-day, you don't you don't need that. The average person doesn't need that. So, maybe that average person is okay with a you know, a model at 1% of the capability at 1% of the price. You know, that that that sounds like a good good kind of trade-off, right? So and I think that's what that's what Nvidia is doing. Nvidia is in ensuring that they have a baseline of uh customers because their their biggest problem is that they have um you know their customers are very consolidated and concentrated like 60 or 70% of their revenue is just like five customers right and the problem with that is that you can kind of see an avenue for collusion where the other companies find a way to force Nvidia prices down and to some extent the whole AI safety thing about not giving chips to China is really a price negotiation on Nvidia and TSMC. It's about you know you have limited capacity this capacity should be going to US customers and then the US customers have their own price points. the US customers are not do not have to compete with the you know Chinese money and I think that's that that's like an underlying subtext uh which you know people don't really want to talk about because it's a little bit collusive um and so the bigger companies to some extent have used this kind of chips span idea uh to constrain Nvidia's market uh and so that they don't have to compete with um other other people the Chinese for the chips so I see that too um and Nvidia is now trying to get out of that hole. And they they also announced uh Windows laptops with uh with with uh you know smaller chips that you could have uh AI AI laptops. They're trying to get out of this you know constrained you know 70% of their customer base is just like five customers and trying to get a wide the widest space of customers possible so that they're not like squeezed out in the future. So I I I
46:17find that I find that you know interesting move by Jensen that that he doesn't he's very well aware of that and he's trying not to um not to get price negotiated. this talk about just how valuable the » sufficing models are versus the frontier models is a great question that we should ask our first guest who I believe is here but I'm not sure if he's seeing or hearing us. Um but » I we got an email saying he's in the green room and » all right there he is. There's a notification in our little green room.
» as well, uh bring him up. I think you should be up there. Human, can you can you can you hear us? I'm not sure his camera is on. He's I can see him in the green room, but I can't hear him or let's see. Okay, let me let me let me check. Um, » he can't see us, but he can't hear us. » He can't hear us. Uh, » all right. Uh, So, he's gonna log out, come back. Hopefully, that'll work. If it doesn't, I'm interested in maybe you need to prompt Codeex to uh debug in the meantime, [clears throat] » but I'm interested in the tying this back to the initial comment about Broadcom margins » shrinking.
» Why would that Here we go. » No, no, » no, no. That's just you again. You from another angle. Um I got excited. Uh, » and [clears throat] » hear me. » Ah, there. » Oh, there we go. » I did the test. » Uh, there we go. Can you hear us? We can hear you now. » So odd. » He still can't hear us.
49:08Quick quick check. Let me do a quick sound check and Okay. Yep, it works. All right, so we have All right, let me let me let me kick him and Could you Could you uh could you try Okay. This is the AGI complete problem. Um, all right. Okay. Let's try. » So, are you prompting Codeex to debug? » Uh, no. Well, I I did the test, but uh let me let me also prompt Codeex to see uh if if it has uh any any ideas. Um this is surprisingly the first time we've had a guest issue, which is uh is unusual. You'd normally expect the guest issue to prop up uh to crop up first, but uh surprisingly uh we haven't had a guest issue so far.
And » so in the meantime, yeah, on this margins question, » yeah, » why would margins go down? It seems to me that as we look at chip prices per hour today, » Yeah. » the rental prices have gone up. » Yeah. And you know there was as always Rune had a great tweet on this yesterday basically saying that as we approach the singularity we may see retail shut down and just everything get absorbed by the sort of you know country of geniuses in the data center are pursuing the Nobel prizes what have you. Yeah, » certainly it seems like if inference is going to be priced based on the value that it can create, » why wouldn't Broadcom be able to capture that and preserve margins for you the foreseeable future at least?
51:39Um I think I think the the issue is that there's only so much margin that you can you can squeeze out because margin is a percentage number right you can't you can't you know you go from 50% to 60% to 70% to 80% to 90% to 99% right like there's only so much margin you can squeeze out um of the of of of the market itself before you create demand destruction right And you squeeze out margin by either increasing your prices or lowering your costs. You can't really squeeze out lowering your cost because everyone else is also uh capacity bound and um at some point they have to expand and they say well if you want me to expand you're going to have to pay in right you're going to have to pay in and fund my expansion.
So that limit like you can grow revenue but it's very hard to grow margin because margin implies that you're taking more of the same pie. So I think that's that's one of the things that I feel um you know all right let's let's get » Who am I on? » Can you guys hear me? » We can hear you. » I can hear you. » You hear me? I don't hear you. » Oh incredible. Terrible. Let me I'll put on my headphones [clears throat] but I don't know what's going on. We will try uh um you see actually now I don't hear you anymore.
Uh, we don't we don't hear you now. for uh live debugging I end up using a codeex on high speed and uh we will we'll try and see if it
54:10can debug live. Um yeah it's it's just a it's just a hard ask I think for uh margins uh margins are not that that easy to um increase. Okay. » Can you hear us? » No, I can't. I have no sound. I don't understand. It's so weird, dude. I just I've been doing [laughter] I don't understand. » Uh my speaker is working. So, » uh we can actually do um [ __ ] I can send you the link and you're going to be a little bit delayed, but you'll be able to hear us and with a little bit of echo. » I'm going to send you the link in the in the green room.
» So, this is the link in the green room. So, I think you can you can just um I've sent you the link in the green room. You can just kind of um » copy uh copy and go onto the link and you should be able to hear us. Um or I can actually also call you. Um let me get interesting sequence of events. Um we all Okay. Let's do microphone. » Can you hear me? No. » Yes, we can hear you. You're good. too. » Can you guys hear me still? » Yes, we can hear you fine. » You can. » Yeah, » you can hear me. Yes. » Mhm.
» So, just call, man. Put me on speaker. I can use my headphones. Listen. » Um, » you want me to call Perash? I can I can try this. Nathan, Nathan, uh, it's, uh, Codex suggests, uh, you and I, uh, toggle our mute, uh, on and off. » Sorry. Go adjust the » Okay. » to zooms. » Does that help? Do you hear me now? » Does that help? Do you hear us? » Sounds like a No. » No. No. No. » All right. I'm going to just call and put you on speaker » and we'll see how that works. I think I think you can call him and
56:40then you can mute your end and then he'll be able to hear us. » Yeah. » Um » All right, guys. » 4 1 2 4 9 8 9 1 5 0 is not available. [laughter] » May just please. We'll have we'll cut that we'll cut that piece out of the uh [laughter] we'll cut that we'll cut that out of the uh uh YouTube. So we'll edit it out. » That's really » luckily Luckily the editing is uh is uh can improve the show post 8915. » Okay. » Your call has to voicemail. Yeah, it it says just use the YouTube X output because it says that that is healthy. Um, Codeex is actually able to live check the check all of the inputs and outputs and the stream. So, it's pretty it's pretty amazing.
I, you know, having someone check the live state of a of a stream, like the technical live state, that would require like a I don't know, like a million-dollar studio. [laughter] You need you need like a dedicated tech person to to manage um you know, the whole the whole sequence. So, it's it's really kind of stunning, I think. All right, let's let's see if we can get him back. Um, all right. And let's » Okay, so no sound. Can you hear me? » We can hear you. » Yes. » Okay, you guys hear me now? You can call me. I set everything up if you want to try me.
» All right, let's do it. Here we go. » We're gonna we're gonna hack this [ __ ] » Yeah. Look. Look. Okay, here we go. We can we can edit it all. Nathan, that looks like Are you calling me? » Yep. Now, Pash, can you hear human? » Perfect. I can hear it » now. You got it. » Yeah. » Okay, » go ahead. Just start try. Let's just test this out and see if it works. » Yeah. » Can you hear? » I can I can hear you. I can hear him. I can » I don't hear Pash. What the hell? You
59:10don't you don't hear me? » I hear you now on I think hear you on my actual Okay, go. » Can you hear me now? » I'll tell you right. Yeah. Yeah. Yeah. I hear you on my headphones. I'm good. I'm good. I'm good to go. » Can you hear me? » Okay. Well, I don't hear precaution anymore that but I can just take it from here, I suppose, and we'll um » Yeah, we can we can we can do the raise hand and we » we're living sort of in the future. » We're going to be in the singularity soon. No, we're not. the future may be uh prone to surprising and uh hard to eliminate bugs. Um » Okay.
» Well, tell us about your company. You're building for soloreneurs and you guys are running back office. We were just talking before you joined about the NVIDIA model yesterday and how it compares on benchmarks like GDP vala from artificial analysis to the frontier models. and we're debating whether or not these open source models are starting to become competitive. Give us the 101 on the company and then I'm really interested in your take on whether your customers demand frontier models or if you're able to at this point um lower their costs and and provide still the quality of service that they need with some of these open models like the one Nvidia just put out yesterday.
» Great. Uh well, let me tell you what I do uh quickly. I'll keep it brief and we can jump into some of the uh more fun topics like that. So uh I'm the founder
Collective — the AI back-office for businesses-of-one
Hooman Radfar250 clients per bookkeeper and climbing, software-grade margins, and a sober economic case for why the frontier labs won't simply absorb the app layer.Hooman Radfar, co-founder and CEO of Collective, described an autonomous finance department for the ~30 million US solopreneurs — formation through tax filing — at accounting-firm outcomes, software convenience, and software prices (entry ~$200/month). His claim: Collective is the only AI-native accounting-automation platform at scale that has actually proven the thesis, with a grounded "our customer base is America" perspective. On open models, he was bullish that they're "getting materially better" (and praised the Chinese open models), arguing a token-cost "apocalypse" is coming inside enterprises that forces a switch — and that for his back-office work (expense categorization, reconciliation) the customer neither sees nor cares which model runs underneath.
The efficiency numbers were the headline: where a QuickBooks/Xero bookkeeper might support 30–40 solopreneur clients, Collective supports 250 and doubling, reaching software-grade gross margins he says no one else in the category has hit — partly because, in his view, "people are overusing inference" where deterministic models would do. He framed the segment as a prosumer market stuck between two bad choices (spending one in four hours on finance admin, or expensive firms), cited a customer who went from $200K to $700K ARR in six months using AI, and predicted real disruption for the ~50,000 small-practice accountants serving that population: "the person that breaks the four-minute mile — once it's been done, it's going to happen again and again."
His most substantive contribution was an economic — not technical — argument for why the frontier labs won't simply eat the app layer in the near term. Once Anthropic and OpenAI go public, two forces constrain them: getting their house in order into a profitable "cash machine," and, at a trillion dollars, needing to win only the very biggest markets (enterprise cloud, search, e-commerce) to move the cap meaningfully. Labor is the real $20T target, he agreed — but "labor drives an experience" that must be built, managed, secured, and provisioned, and trusted/regulated outcomes are where app-layer value persists: "is Anthropic going to want to be the one who signs a tax return?" His TurboTax analogy: it's cheap, yet many still pay for "someone's ass on the line." Where he would focus regulators isn't token pricing (the market sorts that out) but the arms race itself — the next model that's "a version too far."
TranscriptAuto-transcript, lightly cleaned · timestamps jump to YouTube
1:00:41and CEO of Collective. Uh Collective serves the 30 million uh solopreneurs uh in the US. It's the largest group of founders, largest group of workers. Um and so right now 80% of people need a finance department. And so uh in AI parliament uh we are trying to build the autonomous finance department. And so uh we are combining the power and outcomes of what you would expect from an accounting firm. Everything from formation uh to tax filing with the efficiency uh and convenience of software. You know single application, single conversation, all your workflows in one place. Uh and also the prices you would expect from software. So uh we've been doing this uh for a bit. I think we're the only AI native platform right now in the country who is at scale uh for accounting automation and actually proven this thesis. So it's it's pretty amazing. Uh we've we've really been working with AI tools since day zero. Um and have a pretty strong perspective on it. I would say it's a very applied perspective and uh a grounded perspective because our customer base is America. You know what I mean? Um so I'm in the valley right now uh in the Bay Area and I think all of us here think you know put aai on your business and uh you know really talk about on the website that's the best thing. I'm not sure it is. I'm not sure it is. uh for everyone else in the in the country. So, uh we can we have that conversation. But with regard to the the models conversation, um you know, I I've been talking to a lot of these different folks, whether it be, you know, the open AIs or the NVIDIA of the world. Uh I do think uh number one, the open source models are getting materially better. I think the open question for the community and particularly the US community is what do you Chinese models because they're really good. I don't know if you guys have tested those out.
They're really good. Um, and I think inevitably, uh, those open source models, uh, they won't be, you know, top tier for every use case. And a lot of it depends on, you know, where you're hosting it, the the optimization on like inference providers, like if you're going, for example, the, you know, who who are you hosting with, how are you tuning it, and everything else. But it it has to be good enough for most applications because the idea that I mean if you look at the token cost right now in particular for anthropic and open AI I think there's a great apocalypse that's about to occur where everyone is going to have to switch plans you know inside of enterprises uh because I mean it's just getting way too expensive and so it is working I think it has to work
1:03:12on our end we look at it differently because a lot of the AI that we develop we do have frontfacing AI and conversational experiences with an application. But a lot of the power for us is what you don't see. Whether it be categorization of your expenses, you know, as as a business owner to do your bookkeeping or reconciliation, you know, whether it's a thousand monkeys or a thousand robots, you don't care, right? Um we do um because we want it to be delivered well for you and we want to make sure we do it efficiently so we can stay in business. And so whether we use model A or B is irrelevant to our customers. I don't even think they know what model A or B is. In fact, I think the vast majority of America could care less.
» So, so the intro price point for collective is 200 bucks a month, right? And » correct » that um feels like it's, you know, obviously right in line with like a cloud max plan or a codeex uh pro plan how how many monkeys do you have in the in the loop? How has that um mix of delivery method changed and where does it where does it sit right now? Where do you think it's going to be in say you know another six months or a year? » I mean it's been fairly drastic candidly. I think we're our business inlection point which uh you not surprisingly is correlated with the inflection point that's happening in the marketplace. uh we're full stack AI accounting solution, right? And I think a lot of the full stack solutions and I don't want to speak for them because they don't operate, you know, Crosby for example or Corgi or I mean there's a lot of these solutions that their thesis is I'm going to architect some service from the ground up uh using AI at the core.
Now ultimately they're they're going to use some people in human loop. In the best case it's for evals and exceptions, right? But most likely they're doing some level of unit work. Now whether that unifor is reusable in training the system uh that's a different story and that's an architecture question. But um for us we we've just been doing it for long enough and we're far enough we're a pretty significant scale. Um the way I would answer is this like a standard bookkeeper which is one element of our service. uh if you were to go use QuickBooks or you know Zero for example you would probably able to support our client size our client type again soloparreneur handful of employees at best right uh maybe 30 40 clients you know per bookkeeper we literally can support 250 and it's doubling so we're getting quite good our efficiency is is
1:05:42pretty um outstanding and it's it's starting to get there where our margins are software margins and we are the first company to be able to do that everyone else their gross margin is pretty poor and in part that's because you know the way you implement and use tokens the way you're designed to use CPU versus GPU there's a lot of um implementation choices which are going to lead to people I think people are overusing inference in in a lot of use cases right we have deterministic models like why are you paying all this money like a thousand times more than compute uh just to do it so I think you have to be artful especially because the token prices have not fallen in the way that I think people had hoped So, um, one of the questions that I had for you is that » Gosh, we can't hear you at all. I don't know if you can hear us, but we can hear you, » I think.
» Yeah, maybe you can call in. We can go old school. You you can do a three. » All right. All right. You You guys go ahead. Let me let me let me uh fix that. » I can add » Yeah, this is this is fun, man. We're literally talking about AI and like all this crazy stuff and models and we cannot get um you know uh » troubleshooting at the same time. Yep. » Yeah. » How much we want to be with you? » All right. So, we're merged. So, » let's go. when you mentioned so first of all just to recap that that um kind of ratio you're at something like in the 5 to 10x range in terms of number of customers you can support per human » correct » um which is pretty notable and definitely suggests major market restructuring to come perhaps you also mentioned using canonical systems like QuickBooks Zero etc.
Obviously, those come with a cost as well. Do your customers have to have those or do you see yourself replacing those if not already at some point in the future? Like, are you going to vibe code your way to a QuickBooks replacement so that all they need is you truly and and don't even need these other things at all. » Look, I think today um and our customer segment is very unique but very large. It is a proumer segment. They're consumers. So, they're left between two horrible choices. Okay? They're spending one out of four hours on finance administration. And with AI rising, remember AI is benefiting them as well.
1:08:13We have a customer that went from 200,000 in AR to 700,000 in six months by using AI, right? And they think they can get to a million. I'm seeing more and more of this. So, there's this whole billion dollar business of one discussion that's been occurring. Um, and I'd like to point out for the record that we said it first, but regardless, um, those that that is it's almost irrelevant. What do you really soon you're going to start seeing 30 million $40 million business? Three. Now, what those folks don't want to hire finance department if you can what's your N plus one hire? You don't want to hire a controller. You don't want to hire an accountant, right? Um, and so they basically are going to use those systems of record. Now, our solution, the other solution, their alternative is that they can drive outcomes with accounting and tax firms, right? But those accounting tax firms don't require you to go buy QuickBooks and all that stuff, right? I mean, you you they you they do that.
That's that's obscured, obuscated from you. That's the same as us. From the user perspective, for all intents and purposes, we we serve as an accounting firm, right? They don't have to bring anything to us. We will handle it. Um there are things they have to do because they are the accountable party running the business, right? I cannot unfortunately you know go and uh you know administer like certain things on bank functions and whatnot but you know it's I'm a member by the way I use it. I'm still a venture partner expat and it is uh about a third of the cost that I was quoted from my account to use the same to drive the same outcomes with software driven solution. So to me it's an inevitability that the space is heading towards like a really really strong disruption. 50,000 small practice accountants right now that are serving 30 million people like and so in a couple years this is this is really hitting it because I'm I'm telling you today not in the future we've already we finished it we're done and you know it's a kind of like the person that breaks the four-minute mile I think once it's been done and the word gets out it's going to happen again and again and again.
» Yeah. How do you think about maintaining enterprise value at the app layer? This has been a big discussion for us and I know you're a a angel investor who's had some uh very notable deals including I believe OpenAI, Stripe, SpaceX, Uber. Um you're on both sides of this, right? As as you're building at the app layer, but also I'm sure you're getting pitched all the time. When people come to you, uh how often do you find yourself thinking, geez, this is cool, but like it could be a clawed skill. Um, and then how do you make sure you're not ultimately a
1:10:43clouded skill? » I think everything's about scale of time, right? And I think if you want to be strict academic about it, if we take a really long scale of time, it's difficult to discern when and how this this market is going to play. there is a possibility and I think people are turning it into like a high probability that a company who is the foundational model provider at super scale um where these things can through codegen will start generating you know agents and replace the workforce and so in theory if anthropic could generate all the agents that we can and then orchestrate those agents and the people in the trusted data could they do what we do there's like again an inevitability we could see there's four companies in the world that you can everyone talks about that happening Now, what scale of time could that happen? I personally am not willing to underwrite that's going to happen in within the next two years. Uh I can give you some reasons why. Um and so for me, what I think you have to do as a business owner is you have to operate of course if you're if you're in an obvious strike zone. Like for example, I personally would not want to underwrite into a company that is in the core strike zone of open AI or anthropic.
Like I'm seeing a lot of AI assistant type work that are consumerf facing and they're like we're going to be better at managing x y and z like your inbox tasks whatever I'm not suggesting those companies won't do well I'm not suggesting that they don't even have value but you're claude and you know openis chatpt most certainly have to develop their competency it' be the equivalent of saying when google is nent and rising in search you're going to go do a search engine or do something in search they have to win that and they have the capitalization to win They don't have to win MySpace.
They just don't have to. And in fact, what I'm going to argue to, and this is not a technical argument, I'm giving an economic argument. When these guys go public and their uh financials are are, you know, there for everyone to see, I think they're going to be two priorities simply, and you saw this with other waves. Priority one, and you saw with Uber, you saw it with like they have to get their house in order and make this an attractive, profitable company, right? Can they do it? Probably. Right?
Entropic looks like they're they're better, but it's it's non-trivial actually because the transformer models are not conducive to uh you know a trivial amount of capex. I mean you saw Google just did that crazy I mean obviously like a first it's like a do move and we're in unprecedented times. Okay. And I don't know if Anthropic has that power the public company to go say oh I'm going to go sell a little bit of stock and and do the same thing Google.
1:13:14We'll see. Um so when they do that they're going to have to do Google's a cash machine. Microsoft's a cash machine and ultimately investors are going to value that cash machine at some point. So you have to work on that priority one. That's going to take resources. Priority two is they're going to be at a trillion dollars. Okay? There aren't a lot of markets that are going to move you up. You have to be winning the biggest markets, whether it be enterprise cloud computing, right? where you see a lot of competition in Microsoft, Oracle, um you know, consumer search, e-commerce, like these markets are so big that you can rationalize building in them and if you win them, you can then move from a trillion to two and two and so forth like right how many markets are like that and so those two forces when they go public are going to constrain anthropic and open the same way they did the Google, same way everyone else because until they have self-generating code that also can build the product that also can make sure those market fit and like all these you know combinatorics everyone's safe so I think there's this like illusion I mean there's always something I when I started my first company why can't Microsoft do it then we got bigger why can't Google do it then Facebooking why can't Facebook do it and so at some point again in the long run sure like if we're going to this like you know period of singularity we can all sit there and argue that but then let's not do anything everyone just stop let's go home and like you know either buy a bunch of guns or celebrate because you the end is near.
» Permaculture gardens. I prefer permaculture gardens to guns personally. » You know what I mean, man? Like it's it's kind of [ __ ] just to be honest with you. Like the whole the whole argument is like then let's not go to work. And it's funny because the same venture capitalists are arguing this are backing up the truck into these new startups as fast as they can, you know, because they don't know. They don't know either. Nobody really knows what's going to happen. But I think as a builder, uh, you can there's certain you can you can chart the course. It used to be probably if you were really really strong you might be able to go out five years. I don't think that's that's the case anymore unfortunately. I think you have to be really really nimble and AI gives us that agility and you have to walk around but I think they'll be concerned by economics and they're they have to make that market cap go up and and driving down cost going into big markets like do you think I don't know Sam building a a I don't know was that thing they did on finance the the thing that they launched is that going to make the company get another half a trillion dollars of market cap going into that segment and he has to answer that question that's tough Google killed build a bunch of stuff to make. But then what are they in? They're in Whimo.
They're like, I'm going to take driving.
1:15:44You know what I mean? Like what else is going to make that market cap go up? There's only so many trillions of dollar markets. That's that's why I'm less concerned. It's not the tech. Now, there'll be a point when the tech gets so good, but it's just not actually that good. If you guys have tried claw for small business, it's like imagine giving your mom doss back in the day and being like, "Hey, there's these things. I have all this like stuff. You can upload skills and you can do all this." and you just give it to your mom and you tell them to go run their business that way.
That's literally what they're expecting right now. It's it's it's actually like laughable. » Maybe Pash uh we might need to do a little reset here before our next guest. Why don't you work on that? I'll ask maybe a couple more questions and then we'll uh break and try to get our tech house in order. » Right on. on I think one big reason I mean so the big market of course that like I think that the frontier companies would answer with is simply labor right I mean labor is a 20 trillion dollar market in the US or something like that and they maybe don't have to compete niche by niche if they can succeed in creating the drop-in knowledge worker that you can just hire by the minute rather than you know by the hour or by the you know the the year and meet your needs that way, right? And I think that's the play. That that seems to be why these companies are » I think that is absolutely the play. I agree with that. But let let me give you an example. Okay, we already have labor and it's at a unit cost. So what you're saying is they're going to give it to you at a better unit cost, » right?
» Like okay, say I go hire four IC's today. Like imagine I came to you. I'm going to pitch you and I'm like uh going to venture cops. I'm going to go start a full stack accounting firm that's going to handle four or five different functions and I have the labor by the way it's a tenth of the cost but they're still IC2s and IC3s. So how are their workflows going to work? How are the data going to move between them? Uh what how do we manage them? How are they going to communicate with clients? What's that experience going to look like? So labor is a unit that's important but ultimately labor drives an experience and that experience needs to be built, managed and it needs to be secure and provisioned. And so I think that, you know, inevitably that's going to be a large part in it. And and look, if they have um an agent that can work better on particular knowledge, that's great. I I can't remember who I was talking to about this, but um actually I think it was it was one of my ambassadors, Darien Sherzi at Great. We
1:18:14were talking about Salesforce, and he had a really funny, but I think good adage. The like Salesforce, their future is not CRM. Their future is Salesforce. They literally should provide sales agents. You know what I mean? like that's what they should be every company should be moving from the tools to enable labor market into the labor market. Um what but you still ultimately have different levels of abstraction of work right so I have IC's I have managers I have people who are up there like I need a CFO my CFO can't go hire a thousand people and manage them right and so there needs to be some orchestration some workflow some trust the way the day is going to flow between and so it's I think there's a lot of opportunity staying in the outcome space and in particular trust and regulate outcomes where like is anthropic going to want to be the one who signs a tax return and says this is good to go.
Probably not. They'll look at your tax return. They might give you a, hey, I did it for you and then they'll put at the bottom of the thing like this might be wrong. Good luck, right? And so you just have to decide if that cost is worth it. But my argument to you is Turboax is really cheap. It's really cheap. It's 99 bucks and not everyone in America uses Turboax. Why? Because they want someone's ass on the line. They want a trusted outcome where they can be like, "Hey, you know what? You looked at this. I can ask you questions. Let's try to do A, B, C, and D." And so again, we're talking about y economics and the price is changing. I think it'll have disruptive effect. Most certainly. Most certainly. I just think um people are willing to pay for certainty and underwrite um you know that particular in business when you're talking about $349 versus like okay, I can go on the cheap and use DOSs and maybe it'll work.
So there's a price to the risk that I think they're willing to describe. » Yeah. Okay. Last question and I apologize for a rough uh technical experience here. No, that's right. » You mentioned unit costs and I think one of the things that is making it most hard for app layer uh folks like yourself these days is the dramatic difference in per token cost that is offered to people directly through a cloud max versus via the API. Yeah, » I am still not quite sure I trust these numbers, but if my claw is to be believed, it says that I've spent upwards of $1,000 in the first four days of June API cost equivalent on my Cloud Max plan, which I'm getting for 200 bucks, and which refreshes every
1:20:44week, and I haven't even hit my limit. So it would suggest that I'm, you know, something like more than 20 times subsidized if I max out my plan, which of course not everybody does. So that's a little bit um makes this a little complicated. But one policy move that you could imagine folks making in a sort of antitrust sort of way would be like you must offer some sort of pricing parody to your first party customers, you know, versus your API customers. Do you think that is something that if the government wants to get serious about um you know protecting diversity and having a sort of you know rich ecology of uh innovation is that something that you think the government should start to think about?
» I think there are pro they probably won't because and I'll tell you my my my underlying drivers for this. So if you look at the the first order bit that they want to prior to consumers because consumers vote um I think businesses have a larger sale obviously because of the um incentives with packs and whatnot. So I could see like who are you protecting if you if you go over token? So you're just saying you're like hey I'm I'm I'm free riding the system in theory right Nathan you don't care like you're good like where it hurts is an enterprise and this is actually happening think this is going to be a massive problem that's about to occur is I I I jokingly say anthropic is like a drug dealer. Okay. So they're like, "Here, go use this thing. Get addicted to it." And then they give it to you for free. And then everyone gets addicted, overuse it, and then everyone hits a plan. And then it's an order of magnitude more cost. And this is going to happen to more and more organizations. Like I can tell you in our organization, it's a it's a great product, right? It's not that. But now I have to sit down and be like, who's going to get this because we're going to hit the limit and I'm not going to pay 10x the cost. And so now you have to rank order developers. But even with the developers, how you going to do that? So it plays into your opex. it plays into your R&D costs. I think there's a lot of poor engineering going on and people are overusing the AI tools again using inferences instead of compute. There's deterministic patterns that are much better that they should use and they should and that'll people will swing to that. There's also the fact that you can use you mentioned the start of the call these off-the-shelf open source kind of models. And so I think people are going to start getting they're going to have to engineer. You can't just go and you know like beat the [ __ ] out of these things and have venture substance.
You're going to say they actually should be an engineer and understand it. But I don't think the government is going to care that big businesses are going to get fleeced by anthropic. Uh I think
1:23:14they tend to care in the reverse where consumers get fleeced and right now consumers are getting a free ride at least today. What I do think is going to happen is these consumer plans are going to change. Uh and they're going to be potentially more expensive if they want to be profitable on the consumer side if that math works out. Where they do need to regulate and and I'll kind of end on this note is um I think and this is very difficult. Well, I mean we started the scale of time these models like you look at mythos and you look at some of these these these leaps that are occurring they they there needs to be some sort of way and regulation is let's leave the blank word but some sort of way that as a group and I call the group that people are technologists are looking at this and saying wait a minute hold on if we don't there's one there's one version of this model that is a version too far nobody knows what it is nobody knows when it's going to happen but there's this the next version becomes too far mythos was a first taste of like, wow, if I open that up, the level of attacks are going to go out of control. Maybe I have to think about it. And so that to me is where I would spend the time regulatory wise, not the cost and price of tokens. That'll the market will sus that out, right? Because, you know, anthropic messes up, open a will fix it.
You know what I mean? That that's a market force that'll fix that. But the race itself, the arms war that's happening, that is actually the result of capitalism. And that incentive to go to the next model being better, that is where you have to regulate because the current incent structure is not for them to build safe, you know, aligned models that are aligned with humanity. I mean, the one I think about all the time is Andrew. Like Jesus Christ, like I don't know if you guys watched Turner recently, like we clearly have just forgotten James Cameron's teachings. Uh, but I mean those guys are running as fast as they can to autonomous trails, it's a good company. I I don't actually have any political problem with it. But like it's scary like you know if you take humans out of the loop which they're debating they're not even it's you know on those these are decided to kill people like that's it's pretty strong and all these models are now accessing the internet and they're working together and so there's just like the combinatorics are going to be staggeringly bad for for an incident. So that's where I would spend the regulatory time is that next move and just making sure that there's people in line. Even I don't even think that they're going to do a good job regulating. We're just shitty at it as humans. Forget government. But just make a step where we take a breath when these things are occurring just to take a look to make sure we're not making the step too far.
» Yeah. Well, I definitely agree with that. The uh arms race is a huge problem and uh you can count me, it sounds like you as well, with the Pope in calling
1:25:44for the general disarming of AI. Um really appreciate your perspective. Again, sorry for the technical difficulties, but this has been a great conversation. The » I hope you guys figure it out. I hope you figure it out. You know, thank you. We will get there. » Uh the company is collective. It's the back office for the entrepreneur solopreneur business of one. Check it out. Thank you, man. » All right. Take care, guys. » Okay. Can you hear me now? » All right. I still can't hear you on the stream, Pash. So, I will maybe leave and return as well.
Never dull moment here on live TV. » And can you » I still don't hear you. Should I leave and come back? I'll leave and come back. Why not? I'll try it. » Give us Give us a second. Tas uh test speaker works. All right. and let Nathan come back. We are waiting and we are testing um step by step and we are also trying to uh do this live. Yes. Yes. Teras, rejoin. Rejoin as well. Nathan, can uh Yes, I can hear you. Nathan, can you hear? All right, let me also drop. » You hear me? » I can hear you. » Okay, I still don't hear you. Um, let me see. The test actually did not work for me, which was interesting. In the lobby, » ah, the test speaker didn't work.
» Um, now that I'm in I don't have I don't see a way to Oh, I can change my Let me just try going to Yeah, Changing speakers isn't making a difference. » It It's not making a difference. Uh » Oh, maybe it did. » Can you hear me now? » Yeah, now I can hear you on the Interesting. » Ah, okay. Okay. » So, it was like it it did a resubscription of the uh of the audio channel. So, I'm implementing like a like a fix now.
1:28:18So, this is the first time we've had a ho a guest issue. Actually, we we started off with a bunch of host issues, but the guests were uh flowing in without a problem. So, » all right. I have you now currently on speaker, » so there might be a little echo. I don't know if that's going to be a problem. No, no, you you can switch back now. Every every time you switch it resubscribes, so it's basically like a refresh of the audio channel. So, you should be able to switch back to your AirPods or should be fine.
» Gotcha. Okay. Gotcha. On the on the AirPods. So, that's good. » Yeah. » Um All right. Let's see if we can get Taurus. » Yeah. Let's see if uh Tus is back. Um I'm implementing implementing like a reconciling um you know fix anyway. So it should be should be up in like a few minutes. It is it it it's actually interesting to work with codecs on this because codecs can work like very fast very quickly but you know I I I am not like a very sophisticated coder uh and like you know like barely barely barely competent. Um and I think you know a more sophisticated coder would have more checks and balances.
and a more sophisticated uh I think coder would have um you know more ways to like kind of deal with it. Um, and the main the main thing that I found is that you have to give uh if you want if you want the code to work, you have to give it an endpoint that Codex is able to check on its own. If you're doing the checking, you almost always get to like some edge case that you know you didn't think about and uh when you enter production, you have this this issue. But if you have codecs like kind of define like all of the test cases and then you have codecs be able to actually like execute and one of the problems with audio is that audio is something that you know has to be tested you know independently on every device which is a which is a tough one. Uh I've had I've had Codex do things like um for example
1:30:50produce a user interface uh on clawed design and then have codecs kind of like uh you know design and get that exact same interface on video as in like every single frame of the video has that same interface uh without without actually like you know filming it at all. And what I ended up doing was I I asked Codex like, you know, how would you do this because I want you to generate like images basically, but the images have to match like an HTML page. Like how how are you going to do this? And Codex said, you know what? I'm going to create this thing where I measure the root mean square error of the product image from FFmpeg and I'm going to compare it to the HTML page that you provided me. And then I'm gonna, you know, I'm going to work on this until I can get exactly the the root mean square error below like 0.06. And I actually like went through a bunch of um, you know, alternating like little fixes here and there and it got it it got there. Uh I I was very impressed because I I had tried you know I had tried to do it and I I tried like tell it do this do that move the move the shading over here like you know move it back and um and it just hadn't hadn't worked. So just fascinating how it uh you know at the end of the day what you need to give it is a way to um verify on its own what it's doing.
» Make the problem verifiable. Yep. » Yeah. make the problem verifiable and the moment you make the problem verifiable um it is able to um get there. So, and and it just speaks to me that I might be and I think a lot of us who grew up coding might be stuck in the last paradigm because the last paradigm it was you had to take ownership of the problem. And taking ownership of the problem meant that you were the one who had to verify it. Like you were the one who had to kind of plan ahead of time and you were the one who had to verify it. you could let someone else do the coding but you know it was your responsibility at the end of the day and uh I think maybe the younger generation doesn't come with that baggage and they just are able to like relinquish ownership of these projects to to these things and just tell them what what they want as the end product and how they expected to
1:33:20test it and just relinquish ownership. I I I call it I call it declaring bankruptcy because you kind of you kind of like at some point you kind of give in to to the machine and you're like, you know what, this is complicated and it's just taking me too much effort to tell you exactly what I want. So, I'm going to relinquish ownership to you. I'm going to declare bankruptcy. I'm going to tell you like, yo, this is what I'm doing. This is what I want. This is what how I'm expecting to test. go for it and just let it go and and then and then you you get these results that are unexpected that the the machine can actually get there. Um and and that that's very striking to me because I think that's like a process that you go through when you use like self-driving on a Tesla when you use coding with codecs. Um I think in every in every vertical you end up going through this process of like at some point you declare bankruptcy and you hear about all these coders right now all the all the people who are coding right now are are saying that they they're just working so hard they just don't have enough time because they have like 25 agents running all the time and they're monitoring them and you know they're checking up on them. And I feel like maybe that's because we're all stuck in the old paradigm of this idea of like taking ownership. And so if you deploy 25 agents, you feel that you have to, you know, take ownership of them.
And instead what you need to do is you need to take a step back and kind of figure out a way to force these agents to like work with each other and verify things on their own so that you don't have to do this. And so you have to spend a little bit little bit of time like establishing this framework of work for them that they can self-verify. Um, and then then it works. So, um, » hello. » Hey, who are you? Can you tell me? » Yes. All right, we're going to do it uh on speakerphone. Um, » I think we can we can » tell me who you are. I know who you are, but tell me your name u so I make sure I'm going to say it right in the future.
» Yeah. So my name is Tas and I'm a founder of a company that is called
Elomia Health — six years of AI mental-health support
Taras Pohrebniak70% of tokens spent on safety, why hyper-realistic voice is a deliberate non-goal, and what front-line Ukraine and US prisons taught the team.Taras Pohrebniak, founder and CEO of Elomia Health, has spent seven years at the intersection of AI and mental health — positioning the product as "Codex or Claude Code, but built for mental health": not a regular chatbot but an agentic architecture where the agents power both the responses and the safety. The team has deployed in US correctional facilities, mental-health clinics and healthcare organizations, and on the front line in Ukraine — hospitals and rehabilitation centers. The striking technical claim: roughly 70% of tokens are spent on safety, via classifiers running in real time and in the background, plus tools and sub-agents that plan sessions and reflect on the prior week, building powerful memory and dramatically lowering the chance of missing a safety-critical signal.
On UX and latency: rather than make users wait 20–30 seconds for deep reasoning, the background "syncing" agents insert guidance for a fast-responding chatbot, so context accumulates as the conversation continues — an architecture Nathan likened to Thinking Machines' real-time voice approach (deep thought in the background, a faster model in front). Pohrebniak was emphatic that hyper-realistic, "fall in love with it like the movie Her" voice is a deliberate non-goal — "one of the most unsafe things you can do" — and that healthcare partners never ask for it. He supports hands-free use (e.g., while driving) but not dependence-inducing realism, betting society will culturally learn this the way it learned cocaine isn't a flu cure.
On regulation, he pushed back on "wild west": the January FDA and mental-health guidance is "pretty clear"; what looks gray is the FDA lacking resources to enforce it (contrasted with the UK, where he expects products like one competitor — removed from the UK market — could face similar US action). He expects regulators to look at how people actually use a product ("I stopped seeing my therapist and now I use [AI]") rather than its marketing label, and a friend is not a regulated medical device. On privacy, he was relatively unworried about government overreach since health data sits with HIPAA-bound providers, and predicted patients will ultimately push their own data into specialized healthcare models. He closed with deployment anecdotes: correctional medical teams crediting the AI with identifying people they didn't know were suicidal, and Ukraine veterans reshaping how the team thinks about safety — "if it works well there, of course you're confident when it comes to normal consumers."
TranscriptAuto-transcript, lightly cleaned · timestamps jump to YouTube
1:35:37Elomia Health. » Yeah. So fascinating. [laughter] » And we see you picture that too. Okay, » that's cool. » So » yeah, I got multiple audios now coming my way. This is going to be interesting. » Yeah, because I had to mute my mic. Has to be better now. » Okay. Um we'll carry on. So, your um company caught our attention because you're deploying AI for mental health, which is obviously a a hot topic in the first place. But not only are you deploying it uh you're deploying it in some of the highest difficulty situations I would say in the world today, including uh in Ukraine, your home country. If I understand correctly, I believe you you started this country, this company uh in Ukraine some years ago. And um you know I don't know too many people who I feel have a a harder mental mental health situation right now than guys on the front lines uh on either side there who are you know just having an incredibly difficult time even rotating out. So the um the burden that those poor guys are under is is ridiculous and anything that can be done to give them relief sounds great. And then the same thing also with some pilot programs that you're running in United States prisons where um I have to say I guess I might rather be in a US prison than on the front line in Ukraine, but uh it's debatable. So um again a very high difficulty environment. Tell us about how you're doing it and um what the findings are from these uh frontline experiments so to speak.
» Yeah. No, actually I I would personally be on the front line than in the US than in the US prison because if you're there like we have the right to defend our freedom and you know that's an honor. So yeah would not really compare it but yeah so like in terms of what we do we uh started in so seven years ago basically and we specialized at the intersection of AI and uh mental health and currently the way we position ourselves is um like like codex or or or like uh cloud code but if it was built for mental health so unlike a regular chat But it is much more powerful uh under the hood and it uses all these like agents and it really relies on the agentic architecture to power the safety and to power the responses and I think
1:38:07the the safety is an important one here and that's why we had a chance to work within those environments including in the US prisons including like the mental health clinics healthcare organizations and um on the front line in Ukraine yes in hospitals was there in rehabilitation centers there as well. » So tell me more about the architecture the safeguards. I mean I we know the basics in terms of like content filtering you know classifiers to to raise alarm bells when needed. Um, but I guess my I I haven't really used chatbots much for any sort of therapeutic [snorts] purpose, but my naive sense would be that, you know, they probably do a pretty good job doing cognitive behavioral therapy out of the box and that, you know, I think a lot of people are using them for that. where do you find they fall short and and tell me more about what you've built that isn't immediately visible to the user to improve on those weaknesses.
» Yeah. Yeah. Yeah. So yeah, the safety I would say that it is the most important part of what we do and um like obviously like a lot of companies would want to build something for their patients for their uh customers using APIs like OpenAI API or Antropic API but they simply cannot because they do not have an in-house expertise to steer those models towards working the way um they would want them to behave and then there's like this regulatory risk that uh you would have to address there and safety risk of course. Um I think an interesting kind of like fact about us is that we spend around 70% of tokens on safety which makes the system like very very expensive. Um but here might be on the technical level we have a bunch of classifiers that are constantly running on the background. Um so it's in real time but also on the background especially when you need to think longer about certain issues or about certain risks. Um and then we like why I say that it it is similar to cloud code because it it is able to use different tools and um for example if you if you want to plan to plan a session ahead or
1:40:38to um reflect on what happened in the past week when like I mean from the chatbot perspective or from the agent perspective what happened in the last seven days and the conversation with this user and how this should inform um our plans. and our actions going forward. So the agent is able to deploy tools or maybe sub aents that do stuff like that and then kind of feed back the information into the main context and then it continues. So as a result of that you have this like very powerful memory and a very powerful planning capabilities uh that are there and like as a side benefit it also like the chances that it will miss something that is important for the user safety they are dramatically lower than that than without this kind of systems.
How are people accessing this? And is the user experience kind of the intuitive » chatbot one that we're familiar with or due to I can imagine due to constraints on you know devices on the front line or in prisons you know there may be very different um system level requirements or oddities that you have to work around. Also, with all these things running in the background, if you want to be maximally safe, you have to increase latency quite a bit to let all those things run to completion before you, you know, can get to the the output that's going to face the user. Um, and that obviously also lends itself to somewhat of a different user experience.
So, um, what does the end user experience look like starting with the device and then, um, you know, how do all those safety layers impact it? Yeah, actually the the latency is a very interesting question because if you just use uh normal large large model like cloud opus maybe it will think for 20 seconds or 30 seconds or more if you would do stuff like we do and of course that would create a terrible user experience. So the way we build it on the technical side is that we have this syncing uh agents in the background but then they uh insert their I would say instructions or guidance for the chatbot. Um so the user talks to the chatbot and it responds normally but um as conversation continues it acquires more and more context. Does it make
1:43:09sense? » Yeah. Uh that reminds me of the » you can actually » Yeah, go ahead. » Yeah, I want to say that you can actually try it because uh we also have this um B2C app consumer app that we have and that anybody can download from from the app store and from Google Play and from our experience like we had this situations situation a few times by now where a person comes and they say hey I I was using your app for four weeks now and I've recommended it to all my friends. have recommended it to my wife and you know please uh let's build something together for for my clinic or something like that.
» So the architecture kind of reminds me a little bit of the thinking machines voice um real time models that they recently described. I guess they've like sort of soft launched it to some um some early partners. the the sort of deeper thinking being in the background and then a a kind of more superficial uh faster response model being, you know, the thing that you actually hear from. I suppose that so far today, correct me if I'm wrong, but I'm guessing so far your main modality has been text. Have you looked into I mean maybe you have even done some some voice. Do other voice models support the kind of interaction that you want to create? And and how excited are you for that? um more time aware thinking machines thing become more available.
» Yeah. So voice is really important for um for the consumer obviously and the main use case that we see is that you can talk to it while you are driving your car. » Huh. That's like the No, but it's not not really, you know, because you you would actually if you talk to people, what they would want to to to see is like a chatbot or like an agent or AI that sounds like uh AI from the movie Herror, right? That is like super realistic and and you can maybe fall fall in love with it. Now, that is not something that we want to happen here.
So yeah, we try to support that like you know hands-free capability so people can use it and um but you know not not more than that uh because we also try to um I think you know because if you think about safety that's maybe one of the most unsafe things that you can do if you deploy this very real hyper
1:45:40realistic voice capability that people will become depend dependent on and uh I And there's different kind of arguments on both sides of it. But if you talk to real users who use it like once or maybe two times, you will understand that u it's just you you should not build a hyper realistic AI. It's not not going to do any anything good. And I mean the movie her kind of shows it. » Yeah, that's interesting. Do you think that the market will stay there on its own? I mean, one thing I have kind of been a little surprised by so far is how little we've seen in terms of like highly polished products for kids. Um, you know, I've got a couple little AI math tutor apps and stuff, but relative to what seems like it's probably possible, the market hasn't gone as far as I would have expected with kind of AI embodied, you know, or stuffed animal AI friends for kids. But I still kind of think that's coming and I think it's going to be up to me as a parent to sort that out. for the most part. You're in a more regulated, protected, licensed kind of space, but still it's all kind of wild west right now. Um, what do you think the competitive environment is going to look like if somebody pops up and is willing to go where you're not willing to go with a hyperrealistic voice? Will they win in the market? And will, you know, do you think this is the sort of thing that ultimately the government needs to put rules around or maybe the market can sort it out? I I don't know. It seems like it's a I have a hard time imagining that we sort of stay in the current state where we refrain in a voluntary way from making things hyperrealistic because it just strikes me that that's what people will gravitate toward. But uh how do you see that playing out and and what if any intervention do you think would be wise?
» I don't know the future. So the but I can kind of speculate right. So the but the way the the market currently looks like is that you can win the consumer with something hyper realistic. You can win the consumer with a product that praises you that that tells you that you are always right. I know we already see
1:48:10that happening with tragic. Um the question is really like what did they win? Uh because the way I see they only win the a bunch of lawsuits against them, right? Uh I don't know maybe the when we talk to to healthcare organizations for example they never ask for something hyperrealistic or for something that may create even the slightest risk uh for the user. But in terms of um no talking about and thinking about the future in and if I were to speculate uh I would say that um yeah that that's a good question. Uh I would say that uh you know like with many things um um you know the I heard that they were using cocaine uh to treat the flu back in the day right but then it kind of learned that it's not really healthy and maybe that's not you know the best uh medication out there and I think that uh something similar to that may happen where we as a society develop this kind of knowledge uh that is embedded in our culture that that that is something that you should not do even if it feels really cool at first. Uh I mean I would bet on that. Um I I think like in terms of regulations, yeah, some people say that it's a wild west. I would say that it's a wild west only for those who did not really read uh these documents that reged because the way I see them they were pretty clear and especially the latest edition that we have from January both from MH and from FDA. I would say they're pretty clear. So I would not say wild west. At the same time the way I see it is that the FDA may lack resources to enforce the regulation at the moment.
Um, and that's essentially what the so-called gray area is, is the lack of funding maybe to enforce the regulation. Now, it it looks like that in the United States. It looks a little bit different in the United Kingdom where a major has resources and has funding to enforce the regulation. And, you know, that's actually what happens to hash, for example. And I think ash is a really great product. And I know that a lot of people love it. But um yeah, they you
1:50:41know um they they they were removed from the market in the United Kingdom and I would not be surprised if something like that happens in the United States as well. » Is there a clean line to draw? I mean the one challenge is going to be of course resources for enforcement although um we may be soon entering the AI powered panopticon. So there is certainly a chance that um everything will be enforced beyond our wildest imaginations. But then I also wonder about what is really the difference if there is there a sharp difference that we can use to distinguish a mental health app versus just sort of an AI friend and companionship app because it sort of strikes me that like it may come down to specific marketing claims. I mean, is that kind of where the the distinction is going to be drawn? And and isn't there always a way for an AI friend or companion app to be a little bit more the cocaine that um the consumer may demand and or at least crave » um without explicitly saying, you know, that we are a mental health uh you know, built purpose purpose-built product.
But still you could even imagine doing like if a friend uh applies some like cognitive behavioral therapy techniques in conversation with another friend that's like not regulated right and it's also I think it'd be very interesting to see where » a friend is not a device because they regulate medical devices and a friend is a human he's not a medical device. So that's kind of like » but I mean but if I'm positioning my AI app as a friend, do you see a clean line that a regulator can come in and say you have crossed the line from friend to something more than a friend in a way that enters our jurisdiction?
» Yeah. » What is that line? » Yeah. I mean the the way I see it and I do not have any insight but the the way I see it they um I think they would they would look at how people are actually using your product. So you can say that it's a friend or you can say that it's like I
1:53:12don't know math tutor or whatever. But if they see people talking about your product like it's a therapist or they saying things online like um I don't know like I stopped visiting my therapist and now I use JPT and it's much better that's what they will use as as as evidence to you know to kind of sue you but you know like what I said previously now the I'm not sure if there is anybody who knows for sure but Um, so the lack of funding is like an obvious reason, right, why why they would not want to enforce it at the moment. But another reason that I think is also true is that uh they are still waiting to see how it develops because I think that essentially the the last thing that they would want to do is to prevent really helpful and useful products from appearing. So maybe that that's that's part of reason as well. But the in terms of like the document documents and the regulations that were published, I think it's pretty clear that uh um I mean I think even Yeah, I think it's pretty clear where the line is.
» Interesting. Okay. Is there I guess one more question. Um I know we're running late and I apologize for our rough technical experience here. how does privacy play into this? Obviously in the US we have I would say kind of in some ways problematic medical privacy laws because I think certainly with AI to process all the information the public at large stands to benefit greatly I think from better information sharing more ability to mine information that's currently locked up behind HIPPA and whatever other you know kind of constraints. um what you're saying in terms of people, you know, just observing how people are using the product. Obviously, some of that, you know, will kind of make its way onto Reddit and whatnot, but the real where the rubber really hits the road is going to be show us your logs.
And open AI is like pushing for a sort of right to privacy in terms of your chats with AIS that would in their mind rise close to if not all the way to the level
1:55:43of the same privilege that you have with a a licensed attorney. Um that we may not get quite that far, but I think there are a lot of reasons to to value that, right? I do I don't want the government like, you know, panopticoning all of my AI chats. When I hear about, you know, things like what you're describing, I do think, yeah, well, some uh oversight, you know, definitely could be prudent. Do you have a a sense for how we should think about striking the right balance there between privacy and visibility for authorities to police this stuff?
Um to be honest I do not feel like there is any issues from the governmental side because the if you are talking about health data it is been holded byers right so now if they may not have an incentive to share this data with with the jpd um and they I I think it's more more of that because we also work with the healthcare providers and obviously when we work with them um we operate in a hypoaco compliant manner like we are HIPPA compliant um and uh yeah I mean I did not see any complications in that regards um so yeah I think and I think you know if my if you know if you would That's what I what's my take on this future? I think that uh few years will pass and it will all become open just because patients would want that to be open and they would want to export their data and upload it into JPT and I think that OpenAI does a really great job uh with u creating specialized models for healthcare.
Um so I think that that's where I kind of moving into um because you know that that's how the market works as well right if there is a company the medical provider that that says you know it's easy is it's really easy to export your data into jabpt if you are um you know if you if you are our patient of course uh if there's a need uh consumers and
1:58:13patients they do prefer to use that medical provider and that medical provider gets a competitive edge, the other companies would be forced to do the same. And I think it will just, you know, pay out. I guess maybe one more just to wrap us up. Um, are there any anecdotes from your deployments in Ukraine or in US prison populations that you think are particularly memorable or inspiring that you would leave people with? or failing that just anything else you would want to uh leave people with in terms of a positive vision of the future?
» Uh yeah, positive. Yeah. Uh yeah, I mean the many there are many stories like that. um like uh in the correctional environment, we heard numerous times from medical teams there that um our AI helped them identify people who they did not know were suicidal and because of that they were able to uh provide better care and to potentially maybe potentially save uh their lives. Um, in Ukraine it's also very interesting because we like before we started working in Ukraine, we thought of ourselves as like these big safety experts. And that's until you meet the first veteran and you know and you get exposed to what they face and how their lives look like. And I would say that our work with the veterans and with the soldiers, it um shaped a lot of what we do in safety because if you test that under those extreme conditions and it works well there, of course, you know, like of course you know very confident in your safety when it comes to normal consumers. Um and like I think that uh more and more people realize that um they do not have to be left alone and that they can use AI. Um and um that that's a good thing you know and even chart GPT you can you can you know there are still people on X who write about uh GPT for all and you know all the things around that but even GPT for all uh if you were talking to it it would say you
2:00:45know hey go and get professional help um it you know and it would help you it would maybe help you to uh be prepared for that and answer some questions. things about that we are more ready and especially and now now it's even better you know GPT4 is is now model but now I think it's executed much better than that » GPT40 uh rip I guess uh soon if not already um thank you Charles fascinating conversation and um » thank you for having me » thank you for serving some of the people who need it most in this world. Um, obviously the I hope the Ukraine war ends soon and that that uh, you know, extreme market circumstance dries up for you, but even if it does, the uh, » US prison population, I suspect, is going to be here and is going to be, you know, in bad need for the foreseeable future. So, keep up the good work and, uh, we'll be following Alia Health.
» Thank you. Appreciate it. » Great to meet you. Bye for now. » Bye.
Ai2 — can AI actually do science?
Peter JansenTheorizer's 3,000 machine-generated theories, the random-number-generator "discovery," and why the bottleneck is verification — not ideas.Peter Jansen, research scientist at the Allen Institute for AI (Ai2) and associate professor at the University of Arizona, traced AI-for-science from teaching models fourth-grade science (terrible pre-ChatGPT, then 80–90% almost overnight) through Herbert Simon's 1980s framing to today's reinvigoration. On the "Eureka moment" question, he gently pushed back on Nathan: the solved Erdős-style problems are mostly incremental, master's-student-level discoveries rather than paradigm shifts. His newest project, Theorizer, targets the higher-level activity — shovel hundreds of papers in and generate organizing theories: one run produced ~3,000 theories across ~100 topic queries by reading ~14,000 papers, each theory shipping with supporting evidence, proposed confirmatory experiments, and (his favorite) "high-entropy" edge-case experiments designed to break it. His paradigm-shift exemplar: plate tectonics unifying volcanism, earthquakes, and continental drift.
The most memorable cautionary tale came from Ai2's CodeScientist: given 50 research ideas, it claimed 19 discoveries; three colleagues judged ~70–80% of the papers at least incrementally novel and sound — but a line-by-line code review dropped that to ~30% real. One "paper" with hundreds of lines of novel neural-network code turned out to contain a comment, "insert rest of neural network code here," and a function that returned a random number — the entire paper was analyzing a random-number generator. He paired it with the benchmarks: the best models get ~80% on ScienceWorld (fourth-grade science) — so they fail to "boil water" 20% of the time — and are "really terrible" at DiscoveryWorld's master's/PhD-level tasks that human scientists mostly solve. He cited the Stanford (Si/Hashimoto) result that AI research ideas were judged more promising up front, yet the human ideas produced better results once actually run — and Feynman's line that science is "about not fooling yourself when you're the easiest one to fool."
Looking ahead, Jansen agreed Nathan's default vision — a reasoning model with deep multimodal (Gemini-Omni-style) integration running grounded thought experiments — is reasonable, but stressed the current intermediate: models use code as the bridge, emitting Python that represents a problem and calls domain-specific solvers (the AlphaFold lesson being that it folds proteins and only proteins). He was measured on recursive self-improvement, calling it partly a repackaging of decades-old ideas (Lisp self-modifying code, AutoML search), with largely incremental improvements — "I don't think it's going to be a huge game-changer, but I've been wrong about a lot of things before." Notably, mid-interview, Anthropic put out a statement that it was starting to see recursive self-improvement. On Ai2 itself, he downplayed the Nathan Lambert departure speculation: people move between tech companies, Ai2 remains big in language models, and a recent NSF investment funds language models for science.
In the debrief, Prakash offered a framing Nathan endorsed: AI inverts the scientific method — instead of hypothesize-then-test, you collect data first and let the machine discover the causality, which is why ML "shouldn't work, but does." Nathan took the bitter-lesson view that the flagrant failures are real but temporary (his radiologist-and-Ben-Todd point: the prediction "just hasn't come true yet"), and that the real near-term bottleneck in biology is data, not algorithms. Prakash drove that home with the economics: getting ~100 samples of a specific cancer can run ~$600K through Stanford Health (onboarding plus per-sample cost, HIPAA and chain-of-custody included), so a Google-trained researcher used to billions of data points hits a wall of "200 samples this year." His closing line: "the algorithms we're going to use to cure cancer already exist, but the data we need is not there yet."
TranscriptAuto-transcript, lightly cleaned · timestamps jump to YouTube
2:01:58All right. Hello, Peter. » Good morning. Thanks for having me. » Too easy. Sorry we're late. We're having some technical difficulties today. And um this show is in part a live demonstration of recursive self-improvement. So studio that we are in is vibecoded by Pash and he was debugging and fixing some of these uh connectivity issues live as we went and it seems like we have it took us still the third guest today uh but we've we've got it fixed. So you're Peter Jansen you're » joining us from Allen Institute. Um, you're also at the University of Arizona and you have been, if I kind of read the trajectory right, on a couple year journey from measuring the deficiencies in models ability to do science to now, and correct me if this is not the right characterization, but now it seems like you're kind of saying, "Yeah, they're kind of getting there, and it's time to actually start to take advantage." So, tell me the story of your uh last couple years, and then we'll really dive into the frontier. Yeah, that's certainly a way to to place it. So when I started in this AI for science space about uh 10 or 15 years ago, we were all trying to teach the best models we could had to do fourth grade science, right? These are really basic tasks. You know, if you put water on a stove, what's going to happen? And you know, every every kid knows what's going to happen. But the models back then were just absolutely terrible at these things, right? And then fast forward um to certainly the the sort of chat GBT revolution when that came out all of us who had been working on fourth grade 8th grade 12th grade science had the models sort of overnight be able to go from you know 30 40 50 60% to you know 80 90% very very quickly and so then we really started to wonder hey I wonder if we could start applying these things to to the sorts of problems that we care about every day right uh as as scientists um certainly you know today one of the tasks or backing up a little bit AI for science isn't a new field AI for science has been very seriously studied and since the 1980s um which is sort of when it was formulated by a fellow named Herbert Simon this very famous computer scientist and before the 1980s people used to think that scientific reasoning was this sort of very special kind of
2:04:28reasoning because it seemed to involve lots of creativity and lots of specialized domain knowledge. And then Herbert Simon came around uh and said, "Well, wait a second. Maybe we can use existing AI techniques like searching through large spaces or optimizing problems or whatnot in order to model this productively." And the 80s was a great time for AI for science. Um and then it kind of uh settled down a little bit. um even though it had a lot of big successes back then, there was equation discovery, so fitting data to equations.
There were um really famous results in trying to discover something called undiscovered public knowledge. So you can picture scientific papers as sort of these puzzle pieces, right? And what happens if something from paper A and paper B could fit together, but nobody's noticed it yet? And a classic example of this is that my wife suffers from migraines. And one of the classic treatments for this is to give uh the patient magnesium. And that was an AI for science discovery that we use today because they discovered that the pattern of you know metabolic deficiencies in uh both magnesium deficiency and migraine happen to be the same thing. Fast forward to today, um, we there's sort of this reinvigoration of AI for science people. Um, you know, there's no researcher I don't know who doesn't fire up chat GPT all the time, right? and you know say hey can you help me with you know finding papers or my research or sort of any other task or even just bouncing ideas off of and so I think there's certainly been this reinvigoration where people have said wait a second if there's this all you know very powerful tool can we use it to do the kinds of things we do every day as scientists » so where are we now you just put out this project called theorizer And you know, I've kind of had to I try to maintain what I call the AI scouting report, which is like, you know, usually a breathless one hour of of trying to tell people uh to the best of my ability, here's the current state of affairs in AI. And of course, that needs like regular updating. I used to say things like um no Eureka moments.
You know, as of GPT4 era, it was like AIS are closing in on professional performance on a lot of, you know, small and medium-sized tasks, but no Eureka
2:07:00moments. Well, that's certainly been blown away, right? We have um novel math problems getting solved, including increasingly important ones. And so we do have Eureka moments even from pretty vanilla LLM long rollout systems. Um there's a lot more obviously I think on the horizon in terms of integrating you know with wet labs and closing a loop that way and training on other modalities and integrating those. And I maybe I want to get your take on all of of that as we look ahead into the future. But as it stands right now it seems like we're like no Eureka or yes Eureka moments. you are getting eureka moments but maybe the next thing is like but no paradigm changes um which is you know that's a pretty high bar um » but I I think I get the sense that this is what this kind of theorizer project is is really probing the the difference between right like » solving an unsolved problem within the paradigm versus actually kind of shifting the paradigm where do you see the frontier of possibility today » yeah that's a great question so I'm gonna I'll push back a tiny bit on one point that you said first and I'm just realizing ing as I said that pushing back is something that the AI models all say these days. I'm really starting to sound like one, which is terrifying. Um, so that big question that you started off with, which is, have there been any Eureka moments right now? And certainly, you know, you open up your Twitter feed every morning like I do and you just see this giant sort of deluge of people saying that they've used AI for this and that and whatnot. My I'm not a mathematician or a number theorist. My best understanding is that most of the like the aerdos unsolved problems and whatnot that have been solved that not too many of them are sort of mindbendingly, you know, obvious paradigm shifts or or big eureka moments, but certainly they're making discoveries, right? In the past two years, we've seen them make what I would probably call a lot of incremental discoveries. Um certainly at AI2 we've had a bunch of sort of code generation systems where they come up with ideas, they run the code for them, we validate the code and yeah it seems like they've made some pretty incremental discoveries sort of at the level of like a master's student. Um but to your point about sort of paradigm shifts which is I think the really exciting thing right so you know when we wake up in the morning we're probably as AI scientists who do AI science we're probably not hoping that we're going to get you know the next incremental result you know wow this model is 0.1% better today than it was yesterday that's not really super
2:09:31exciting but we're hoping to really change how science is done or find new problems or new ways of thinking about problems and certainly if an AI I scientists can help with that. That would be fantastic. So there's this um AI scientist uh Arvin Nurion and he has this criticism that I really like about AI science right now um in his book uh which I think is AI snake oil and he says that if you have a lot of AI scientists or a lot of anybody sort of working on doing what we would call sort of experimental science which is I think what most of us would call sort of breadandbut everyday science. you get up in the morning, you come up with an idea, you run it in the lab, uh, and you write up that experiment.
He says that if you have a huge amount of experimental science, then it actually counterintuitively is slow for or it slows science because, you know, you wake up every day and you don't know what to attend to. There's a certainly I can't attend to everything that's going on even in my sub area. And so he says that this slows science because science isn't just about doing experiments. It's also about coming up with theories or making paradigm shifts. These sort of higher level scientific activities which were framed by um Thomas [ __ ] this famous philosopher of science uh back in the 70s. And so what theorizer gets at is it says well wait a second um could we take uh data could we take scientific papers by the hundreds and sort of shovel them into this theorization making system and say okay tell me human scientists what kinds of problems you're interested in and then I'm going to go off and read hundreds or thousands of papers and try and come up with theories that automatically organize all the experimental results that have been found in that area in a A that would be really hard or really effortful for humans to do. And so that's what theorizer has done. We ran this on I think a hundred different uh sort of we call them theory queries just make me theories about X that were this broad cross-section of AI topics that we automatically sampled from the literature. And it came up with 3,000 theories um on topics that were ranging from sort of everything you can imagine, intelligent tutoring systems to algorithms. um and by reading I think about 14,000 papers. And so that that seems like a big shift, right? That seems like if it can help us discover not just a new experiment to run, but
2:12:01better experiments, better questions to ask, better research questions to ask, better ways of framing things. Um it would be fantastic. A classic example I can give you of this is that um my classic theorization example and paradigm shift example is that if we had lived about a hundred years ago then um and we were geologists then there would be all these people studying volcanoes and all these people studying um earthquakes and maybe some people in geography would have noticed that hey if you sort of look at the continents on a world map they kind of fit together. Doesn't that look weird?
Um, and there were theories about all of these things sort of separately. And then someone eventually came along and said, "Well, wait a second. If you imagine that the Earth is built on these big sort of shells or plates called tectonic plates that kind of rub together every now and again, it actually explains volcanism and earthquakes and continental drift and a bunch of other things, but also allows you to ask a lot more productive research questions and it completely changed the field. So, wouldn't it be wonderful if AI could help us do that?
» Uh, yes, as long as we can keep the recursive self-improvement process on the rails as we go. Do you um have any highlights from the theorizer work that you would say were like particularly, you know, useful theories? And I guess how easy is it to evaluate theories? Because this seems like a key bottleneck, right? If the theories are are hard to evaluate, you have a a hard time getting that flywheel turning. Um, so how do you sort of structure them to be as easy to evaluate as possible? And and are there any that really, you know, rose above the rest and and stood out as meaningful contributions?
» Yeah, certainly. Um, so there's a bunch there. One of which is how on earth do you go about evaluating a theory? And then number two is have we found any sort of exciting discoveries yet? I'll I'll sort of frontload it with the last one, which is I'll give you two quick stories. Um, story number one is when we designed theorizer, I had these ideas about how I would apply it in the sub area that I focus on a lot in, which is sort of agents and virtual environments and bolting on memories to them and whatnot. And so when I was testing it, I would give it a lot of these sort of highlevel queries and it would come back with um theories that seemed very
2:14:33reasonable and are things that I myself had sort of hypothesized but I had never written down. Um things like if you're making an agent that's for a specialized task and you're bolting on a memory to it, which is a very popular idea these days, that you might want to match the kind of memory to the kind of task and that if you have task matched versus task mass mismatched memories, then you know you'd get different performance things of that nature. There's sort of lots of theories. Some of them aren't necessarily mindblowing um uh but they sort of help you formalize a discipline that until now has sort of been really um sort of haphazard isn't a really good word but experiment driven is maybe a better uh word to use there. Um, I can't share any super exciting discoveries yet, but I will say that we're been working with a bunch of partners um who uh in the medical domain um and they seem to be getting a lot of utility out of it. Uh so hopefully um hopefully we'll have some exciting stuff to share on there um in the near future. Back to how you evaluate theories. Well, one of the one of the funny parts about Arvin Nurion's AI snake oil criticism that you know if you have too much experimental science, it makes it hard to see what's going on and sort of filter through it to do the higher level scientific activities like theory building. It turns out if you have a machine that you can just press go and in a day you can get 3,000 theories by reading the entire literature of of an entire subfield.
Right? It turns out now you almost have that same problem, right? you have so many theories and it's effortful to go through theories and look through them. Um, you sort of have to look at them and they they're designed to be as easy as possible to digest. They they give you the theory or the law that they've sort of proposed. They give you a long list of supporting evidence and all the sort of papers that they've come from. They propose experiments for you uh to think about. Um, some experiments are sort of easy experiments that are, you know, if this were true, what would you expect to also be true? So those are sort of confirmatory experiments. For me, the ones I get the most utility out of are the ones that we call high entropy experiments. And those are things where it's experiments where if this theory kind of held or if you really wanted to sort of test this theory to see if it broke, you would give it this research question. and it's sort of an edge case and you really don't know what would happen uh when you uh if you were to test it in that way and those are super
2:17:03exciting. So yeah, it's a it's definitely an effortful process though and I don't think we've ever had a um a machine a system that could produce theories and so learning how to best present them to users and humans in a way that they can digest as an open research question. » [snorts] » Yeah, there was a really interesting experiment that you're reminding me of around AI's ability to come up with new ML research direction ideas. And I think this came out of Stanford. I I did an episode of the podcast on the first leg of it. Um, and basically the finding was like the AIs could come up with better with Changlay C from Stanford. Can AIs generate novel research ideas? And the first evaluation was having humans go through and look at all these different research directions and evaluate them.
And the headline was the AI ideas were judged to be » more promising than the human ideas. » To their credit, they followed up on that and actually ran a bunch of these experimental directions and then went back to the human expert panel with results. And then it was like okay well now which ones have actually yielded better results and the human ones though they were judged less promising up front did in fact uh generate more interesting results later. So it is it is clear that we still have some um both sneaky advantage I guess in that space at least for now and also a very difficult time it seems separating the quality of the core ideas from the surrounding presentation trappings all that sort of stuff. Um the AI is obviously really good at all that so it's like um they even took steps I believe to correct that by like having AI kind of polish up the human writing.
So they they even tried to control that effect, but it seems like maybe didn't fully manage to uh control for that effect. I'm not sure exactly what the what the mechanism is to explain why that reversal happened, but there there's definitely something really interesting there. When you think about kind of moving to paradigm shifts, on the one hand, yes, it's like very effortful to wade through all these new things. On the other hand, you can tell maybe a story that's like we need all these complications
2:19:33of the current paradigm in order to build up enough uh you know what do they call them the uh the little epo epicycles epicycles right was the um gradually acrewed in the in the motion of the planets that finally led people to think there's got to be a cleaner explanation for this. So I I wonder if we're in a phase of, you know, we're just fundamentally labor limited and we're also maybe maybe incentive limited in some ways because as you said, you know, scientists don't want to get that 1% improvement. But first of all, there's just probably a lot of value in grinding out the epicycles that we can grind out within the current paradigm. And then second, I wonder if that is a really kind of necessary step to create the data to then feed into the big brains, whether they be human, AI or some sort of hybrid at that point to have that really, you know, kind of uh paradigm shifting insight. How how would you react to that story?
» Oh, wow. There's a lot there. Um, so yes, uh is the short answer. Um, the longer answer is, you know, back when I worked in fourth grade science, my life was a lot easier because you put the water on the stove and what happens? It boils, right? There's a clear answer to it. The most exciting I love heart problems. The most exciting part about my day as a scientist is when I get to work on something where I don't know the answer and it's going to take me a while to figure it out. And I know that there's things that are just unknown about the space that I get to discover.
Every theory is wrong. And that's one of the first things they teach you in grad school, right? Every theory you've ever heard is wrong. But a lot of them are useful. And as long as we can create better, more useful abstractions that help us really work through the problems better, I think you get utility out of them. Another favorite historical example of this is that for a really long time, we thought that the reason why people got sick was this theory called miasma theory that it was bad air, right? And it actually explains lots of really wonderful things about sickness. So, oh, there were a bunch of people who were together in a room and then they all got sick a couple of days later. Oh, what
2:22:03happened? It was bad air, right? And you can see a bunch of reasons why that is explanatory and it's actually predictive, right? It it explains a subset of why people get sick, the idea of airbased transmission of uh viruses, but it doesn't explain a lot of it, right? And eventually after people chugged away on that theory for a while and iterated on it, they came up with a germ theory of disease, particularly when better instrumentation was available like a microscope and you could see these little things crawling around. So certainly um certainly chugging away on theories um sort of banging on them as these sort of abstractions of problems that you know I think the lay person might expect that a theory is this like immovable object that we sort of believe just as is.
That's not how scientists use theories. we use theories as this sort of living abstraction that we sort of beat against um in the hopes that it leads us to a better abstraction uh tomorrow and or a year down the road that helps us ask better questions and make better progress on the problem. And so in that way certainly certainly given how effortful that process is uh how effortful the theory building processes if we had a way of mechanizing it I think that it allows you to do two things number one make faster progress and then number two um certainly on something like um medical uh domain research where you're really in like a resource constrained environment right needs of the many outweigh the needs of the few. If 90% of people suffer from, you know, X manifestation of the disease and 10% suffer from Y, you probably got to focus your money on the X rather than the Y.
But if it now is faster, cheaper, uh, to be able to do underexplored research, I think we can gain a lot from that. » Yeah, that reminds me of that story of the guy who used Chad GPT to find a cancer treatment that saved his dog's life. uh when obviously that was not something that was going to mobilize great scientific resources if it wasn't for the AI to make it accessible. Um, I guess another story I'd be interested in your take on is integrating other modalities with reasoning models, which is something that, you know, kind of one Eureka moment arguably that we've had in AI for science is like the alphafold line of work where we used to have to do a whole PhD's worth of tedious crystalallography
2:24:35to get a protein structure and now we just guess them and it seems to work like roughly as well. Amazing. Um, obviously models already can use those things as armlength tools, but we're also seeing with and I think Google's Omni model and Luma has, you know, there's there's various examples of this. Runway I think is kind of doing this too, but you've got these deeper integrations of reasoning and so far pixelbased modalities, image and video. But you can see clearly that there is this ability to reason and ability to operate in pixel space h uh that is connected at a very deep level right so it's not a lossy armslength tool call anymore but really some sort of joint understanding and I'm expecting that this is going to be a huge deal for a ton of different areas right because what we don't have as humans is the intuition for how this protein is going to fold or what the band gap is going to look like in this you know new semiconductor doping recipe or you know any of any of a million other you know super esoteric things. Um but the models models generally do seem to be able to form that and we do have enough proof point that it seems like those intuitions can be integrated with reasoning that I guess my default vision of super intelligence is a reasoning model like what we have today that has an a Gemini omni sort of integration with all these other physical science domains such that it can kind of run thought experiment experiments that are actually really well grounded in a sort of intuitive physics and then also reason about them in a a kind of, you know, higher order language uh deductive sort of way uh like the best human scientists do.
What's your take? You know, um you could call call me out for AI snake oil if you want to, but I I my bet is that that's coming. What do you think? » Yeah, I don't think that you're wrong there. Um and I think that that's a very reasonable way to expect that the that the near future might pan out. Certainly historically language models have lived mostly in text and then in the past few years there have been just wild advances in making the multimodal um and a bunch of that training um has happened at AI2 and we certainly have um just wildly impressive uh open and fully
2:27:07open um multimodal models. I think that there's a few there's certainly ways that make this promising and there's ways that also make this challenging. Um right now for most of the work for example that I do a lot of the content of a scientific paper is just unavailable to me. So if I wanted to look at the figures of a paper, those are especially for like complicated material science figures that I as a human have a great deal of uh trouble figuring out. It's very very hard to sort of interpret those and figure interpretation is um is a very open problem. It's been an open problem for a long time. I think we're making pretty good progress on it, but certainly it's going to take a while. I don't think that's going to be solved next week or anything. um to the to the sort of alphafold example that you gave where the models have to work in these really highdimensional spaces. I think that I think that what you're actually getting at there is a deeper intuition of what's going on in the field right now which is this transition from sort of problemsp specific methods is what we call them to sort of domain general or problem general methods and so alphafold is a great example of this. So it it was tremendous work. Uh I think it won a Nobel Prize recently and they basically built a machine that could fold proteins and predict protein structures with wildly better accuracy than sort of anybody had ever done before. And this is a wildly useful thing that people researchers in biology use every day.
But the way that they solved it is that they solved it by building a machine that could fold proteins. It could represent those proteins and only represent those proteins. Right? if you asked it a question about, you know, geology or anything else, right? It's just not going to work. It's like a It's like a a chess playing, you know, deep blue. If you asked it how to play chess, it would be fantastic at it, but if you asked it at any other thing, then it wouldn't be good. certainly these sort of multimodal spaces if you want to think about it that way. Um representing a protein as this sort of highdimensional space and you got to figure out how to represent it in the best way that that helps solve whatever problem you're trying to solve. Um but then you also sort of have to work through it. That's certainly a uh a big win, right? If you could imagine a language model, you know, Elmo 10.0 know or something that could dynamically represent any kind of highdimensional problem and work through that space. Um, you would probably it would probably be
2:29:37a big win. I will say that right now what we do is almost an intermediate of that which is instead of the model thinking in that high dimensional space or at least I'm making I'm making that assertion or that assumption right now I can't reach in and look at what it's thinking. Um what we do is we use code as an intermediate and so you'll the model will output Python code for example that represents the problem in some computable form and then either it or uh it'll make an algorithm or it'll call some external library for protein folding or materials this or that that solves it. And so that's probably a pretty good um intermediate form and that's certainly where people are making a lot of progress these days. we get we get a lot of utility by by sort of having an interface to these domain specific simulators and solvers for proteins and whatnot. Um and sort of the the human interpretation of those.
[gasps] So just today while we've been live on the show, Anthropic put out a statement saying that they are starting to see recursive self-improvement happening. um exactly what form I have to read, you know, deeper into their um into their statement to see what they've disclosed, but you can sort of imagine all of these things happening there. You know, one big challenge of course is you got to have the raw data to to build some of these u other modality models. But it does strike me that the frontier companies are going to be in a pretty interesting space when it comes to maybe even teaching models to sort of natively understand ML architecture space, right? They they could you could imagine them training on like experimental setup predict loss something along those uh lines. And if they get good at that and that becomes sort of an armslength tool, you could really imagine things accelerating at a few companies that have the the depth of the experimental record in like pretty interesting ways and potentially pretty exciting ways, potentially pretty scary ways. I'm always reminded too of the just how small the data set was that was used for alpha fold like tens of thousands of proteins not you know not millions not billions not trillions but like just tens of thousands maybe upwards upwards of a hundred thousand but like a a small number by you know any sort of big data uh definition. So
2:32:09plausibly in the ML experiment space, you could imagine a similar thing where you may not need a zillion experiments to get your models to, you know, kind of I I sometimes visualize like uh vacuum pulling uh a wrapper around some, you know, uh conceptual space until you just get tighter and tighter tighter down to you really, you know, you really understand the shape of that space. I could see that happening in ML. Um, this obviously takes us to, you know, strange places pretty quickly. How how fast do you think this is going to go? Are you um buying into the country of geniuses in a data center 2027 2028 time frame? Uh, and if not, like what would be the what would be the core reason not?
» That's a great question. So, um I think that sort of like a lot of other AI terms, um the idea of recursive self-improvement is potentially like a repackaged version of fairly old ideas that aren't necessarily um super shocking to uh AI researchers. So, one of the oldest uh programming languages that was used for generating AI code for decades, which is called Lisp, actually lets the code itself sort of run code that it it is itself generated. Um and so people have been exploring sort of recursive code generation. Um the code the AI model generating its own code for for longer than I've been alive. Um more recent stuff um in auto machine learning is what it's usually called or autoML has models that sort of try and try and make better models themselves. Um and this has been around it's been explored.
Usually it looks like a search space problem where you know there's 20 different knobs you can turn and they each have a large number of possible values. And so you just got to hope that you turn all the knobs in the right direction enough times that it gets you to something that's better than before. I can certainly imagine that with language models people have been doing this work for at least a few years. Um my colleague Kyle Richardson uh has been doing a lot of sort of automated um you know he calls it language modeling for language models um that sort of recursively does this and tries to get them better. So that in a way isn't surprising to me. I think that the improvements are largely fairly incremental. Certainly the models have been um primed with synthetic data that they
2:34:40themselves had generated for a long time. So well it well it sounds sort of super exciting and you can sort of imagine turning on the Sci-Fi channel when you were a kid and you know watching the outer limits and that somebody would plug in this model that itself sort of gained intelligence over a day or something like that. I think that that's maybe a a little bit distant from from the reality and that there's sort of an established uh an established trajectory over the past couple of decades that you know we're comfortable doing this and um I don't think it I don't think it's going to be a huge gamecher but I've been wrong about a lot of things before.
» Were you um were you on the Curtzwhile uh prediction path? Because I I do feel like I am looking at the sci-fi future right now. I don't I don't uh and I was, you know, Curtzwhile aware 20 years ago, so I wasn't it's not like I wasn't even exposed to those ideas and I was open-minded enough to take them somewhat seriously. But even having seen 20 years ago the Curtzsw Wild graph that puts, you know, human level intelligence right around now, I still feel like I am kind of living in a surprising reality.
Has 2026 uh AI progress not surprised you? » Oh, it surprises me and doesn't surprise me all at the same time. Uh some days I wake up and I feel like I'm living in the future and other days I wake up and I feel like I'm, you know, living in this sort of strange reality with all these agents that can't do the things that I want them to do. I'll give an example that's really grounded in um the in AI and scientific discovery. So we have this project code scientist uh which looks very similar to a lot of the projects that sort of you pull up on Twitter every day u where people say you know I made this AI agent and um which is like a thin wrapper on some openai or cloud or whatnot and it generates code automatically and it generates ideas automatically and runs in a loop and away we go writes papers and so we gave it um 50 research ideas and let it chug away for a couple of days and after uh a few days it came back and it said, "Well, I've discovered 19 new things." And we were very excited, right? Wow, 19 new things. We live in the future. Life is great. All that jazz. And so it wrote papers on those 19 new things. And we gave those 19 papers uh to uh three
2:37:11colleagues at AI2 who hadn't seen it before and said, "Tell me if this is a real discovery. You know, look through these papers. Here we go." And they went through them. And I think it was 70 or 80% of the papers. I said, "Yeah, it's probably at least incrementally novel and minimally scientifically sound and whatnot." And then we convinced uh somebody me um to go through and spend days and days and days looking at the thousands upon thousands of lines of code that these models were generating uh to support their discoveries. and it went down to like 30% of the discoveries were probably real. And the things that you see are absolutely all over the place. One fun example is uh the AI came up with some fancy idea for making a new neural network architecture with some fancy new kind of attention. I don't know. It wrote hundreds of lines of Python code with all this neural network code that I have absolutely no idea what it was doing or um can't understand any of it. And so I'm going through and I'm like, how on earth am I going to review this? This is in my domain area. And then I get to like the end of a couple hundred lines of code and there's just this comment that says comment insert rest of neural network code here. And then it picked a random number and returned a random number from that function. And so this model, this paper, this entire paper was analyzing the values of a random number generator. um and that you know isn't shown to you know nobody knows that if you're reading the paper but the science itself you know it's hard to evaluate it. it's hard to be um sound. And so a lot of the, you know, when you see it do something amazing, it's easy to be very impressed.
But then when you use a standard benchmark like we have science world and discovery world, these um sort of virtual environment benchmarks, science world does fourth grade science. Um discovery world does uh sort of like a masters or PhD level science. The best models right now are getting something like 80% on the fourth grade science. So you ask them to go in this environment and boil water and they can't do it 20% of the time, right? That's wild. Or you ask them to go in and give them a toy task. You know, the colonists on planet X are getting sick. Figure out why and solve it. Um they're really terrible at that, right? They can't solve most of those. Whereas you give those to real human scientists and they're get most of
2:39:41them. So, so it's the summary of that is it's really easy to be excited when they work well, but you got to pay attention to all the really simple ways that they break um before you get too excited. I think that's not to say they don't have utility, and there's lots of places that they do have um lots of very near-term utility, but I think my job is safe for a little while. Yeah, that that again reminds us of the importance of verifiability. When you have to spend days and days going through all the papers, it becomes a real challenge. But if you can find ways to surface the moments of of brilliance without getting uh sucked into the um the fakery, then you can really get some some cool stuff out. I don't know if there's anything else you want to talk about on the science front. I'd welcome um anything else you think I should have asked about or you know things that you think are kind of underappreciated. I also wanted to ask one question at least about um AI2 and just kind of get your uh take on like what the future is for it. I saw a friend of the show Nathan Lambert just uh announced his departure the other day and so you know people are kind of wondering what the future of the institute is going to be. Are we going to still see large language models open sourced? Is it going to go in a more AI for science direction? Um, you know, just kind of curious about the overall trajectory of AI2 and the role that it will play in the ecosystem as you see it going forward.
» Oh, gotcha. So, the first half of your question, which is, you know, what haven't we talked about that um that's relevant for AI for science? I think that probably um not that we haven't talked about it, but a summary would be that right now there's a huge amount of hype um in AI for science. Um again, I I open up my Twitter feed every morning and I just I see more things that I'm excited to read than I can read. Um at the same time, it's never been easier to get started in it. um in that um just for fun the other day I sat down and I tried to write sort of a minimal AI for science bots sort of something that generates code and runs it based on an experiment and it took me less than an hour without any coding agents at all just sort of bare Python bare sort of notebook editor to write a minimal science agent right and so anybody can
2:42:11do this you know undergrads can do this it's not that hard uh to get started um at the same time um you Twitter isn't peer-reviewed science. Um, and even peer-reviewed science, right? You you put this paper through um three very tired reviewers who have 10 other papers that they have to review. They do their best job, but reviewers are looking at papers. They're not sort of looking at code. They usually don't look at sort of detailed data analysis to make sure that you you did everything right. Um, and so we're really suffering from this evaluation problem right now where it's really easy to make a claim, but it's really hard to verify the claim. And previously, the claims were all made by people who, you know, you could they'd gone through 30 years of school and were really heavily invested in this and, you know, if you make a bad claim, your reputation kind of takes a pretty uh big hit, right? whereas the barrier to entry right now is so low and the the normal institutional things like u peer review to verify these discoveries um are sometimes not there certainly if it's you know just a hyped result on Twitter.
So I would be wary of that. I would be um I would be as a research method scientist um preaching the virtues of how science is not just about ideas and it's not about implementations. It's about doing the science right. Uh Richard Feman, a Nobel laureate physicist um who invented quantum electronamics and was also a drum player um had all these wonderful quotes and one of the quotes that he said that's one of my favorites is science is about not fooling yourself when you're the easiest one to fool. And so, you know, having wonderful research methods and really excellent confidence uh in the results that you uh that you output is is super important.
» Great. Uh tell us about the future of AI2. » So, you picked the lowest guy on the food chain to ask this question. [laughter] Um and so I'll just give you the high level which is that um you know AI2 uh is one of the most wonderful places that I've ever worked. It is just everybody there is the brightest person you've ever met and the nicest person you've ever met. They do amazing work and you just sort of wake up in the morning and you're super excited to be there and work with those folks um working on impactful problems. Um at the
2:44:41same time AI2 is a technology company and there's regularly you know like every tech company people uh come and go all the time. Um I for I do know what you're talking about that there was a few people on another team the the open language model team um that I think moved uh to another company. Um things like that happen all the time at tech companies. Uh I know that there was some speculation on Twitter um that people thought that we were no longer doing language models. I have no idea where that came from. Um AI2 is big in language models. We have a a just incredible people uh working on that team and um I think just recently had a massive uh investment from uh National Science Foundation to work in part on AI models or language models for science.
So uh it's sort of never been healthier, never more resources poured into it and um I think that there's just fantastic stuff ahead. » Cool. That's great. Thank you for the update. Um » thanks for having me. Peter Jansen, sorry we started late, but appreciate you staying with us and uh great to get your overview of all things AI for science. We will absolutely be watching this space. » Great. Thanks. » Thank you very much. » Can you hear me? » Yes, I can. » Oh my gosh, [laughter] » we're all debugged. I didn't want to test in the middle. So I was like, you know, I'll let you take the uh take the brunt of it. So I I found I found uh Peter Peter didn't want to comment too much about the AI2 thing, right? um which which I found interesting because that's obviously you know as Nathan Lambert leaves um you know the the the obvious question is you know what what happens next uh to the firm um it must be tough because they they must have like every researcher must be getting like million-doll offers right and the the biggest game right now is uh How do you retain, you know, your talent? Uh because it's not it's not easy to retain talent um right now. And um there are very few like um tasks that you could do that could justify paying a researcher a million dollars a year because if you
2:47:11have a if you're going to, you know, pay someone a million dollars a year, you need that person to produce like $10 million of output. The way VCs work is like they they say like okay does this person have a one a 10% chance of producing like hund00 million of output right and if it's a if there's a 1% chance uh if there's a you know 10% chance of $100 million of output that's $10 million like okay then you can afford to pay them a million right so um I think it's it's tough um and uh Peter didn't really want to talk about that I So, you know, I I I respect that. Nathan was Nathan Lambert was the one who was uh very prominent, I think, both on Substack and on Twitter. So, um yeah, let's see. Let's let's see what happens with that. And uh what did you think about the overall thrust of AI for science? It's obviously I mean AI tools you know Peter's Peter's particular thing because a lot of what Peter's done has been that you know these models are not that competent at science and it's quite a quite a like opposing thrust to what um open AI and deep mind have been doing not not so much uh anthropic but yeah openai and deep mind what they've been doing of saying is is is quite different I » [sighs] » Yeah, I think I have to take the bitter pill perspective on this. As always, probably it's a it's it is hard to reconcile as always the flagrant mistakes and you know failure to even complete the task as he described you the one example of returning a random number. So, I hear that for sure as a, you know, all the the scientists aren't immediately at risk, but I don't know. I kind of still feel like all these things are headed in the same direction. Um, I need to get deeper.
Apparently, um, Ben Todd from 80,000 hours, they just put out the 80,000 hours, uh, 10 year anniversary edition of the book, you know, all rewritten for the AI era. And he has an analysis, which I need to revisit on why the prediction that you shouldn't be a
2:49:41radiologist, which was made some years ago, hasn't come true. And in the end, I'm just kind of like, I think it just hasn't come true yet. I I don't really see it for all that much longer. So maybe I'll end up looking stupid. Um » but I I think the the original analysis really seems like it holds to me. And in science it's it I do think it remains to be seen whether we will get paradigm changing discoveries out of models. That's a, you know, arguably the quadrillion dollar question, I suppose. Um, pretty soon you're talking real money.
» But yeah, I don't know. A lot of the stuff that scientists do, I don't think is really all that, you know, it is a grind. A lot of it is a grind. » Yeah. the the idea that like you know I think what he expressed was an aspirational uh perspective on what scientists want to do you know they don't want to just grind out a little improvement they want to do these sort of conceptual things true for sure um at least early in their careers you know when they're um you know when they're dreaming big dreams in reality there's a ton of incremental work and I think a lot of it's probably really valuable and what little contribution I made to science was over the course of a year as an undergrad chemistry research assistant. I don't know if I've ever told you this story.
I've told it other times on the podcast, but I worked for a full year, not full-time, but you know, probably 20 hours a week as a research assistant. We were developing one reaction type » and it was already, you know, very esoteric thing that, you know, uh, hydrocarbon oxidation under mild conditions. Um, you know, in theory it had value, but it was like the kind of thing that, you know, is is already deep down a particular rabbit hole. And my one discovery was based on an observation that an experiment that we thought was going to work one way seemed to be working the other way. We thought by adding more acid to this one reaction that we would drive it forward and get more of the product out. And I just ran a, you know, little basically a parameter sweep, right? It's the same thing that that people do in ML all the time, but in my case, it was like just literally a row of vials with everything the same except for more acid in each vial going down the row. Put them all into the same hot bath. Measure them all
2:52:12at the same time stamp. And it was going the opposite way. We were getting less product for every thing that we every, you know, incremental bit of acid that we added. And so I was like, well, maybe we should try less. That was my I had one actual insight that led to new knowledge and a new way of doing things over a year. » Um that was enough to get me, you know, an authorship uh on that paper. And you know, it's not nothing. » But I just don't really believe that things that we call science are going to be beyond the reach of AIS for all that much longer. Um, and then truly elite stuff, you know, we'll see. But I do think we're going to see a paradigm shift in the way science gets done, if only at the grind it out level » pretty soon. So, um, let let me give you like a framework and see whether um, you know, it makes sense. Um a lot of what we used to do for science was we used to get kind of a hypothesis like you know uh this is going to happen because of this and then we would kind of create experiments to test that hypothesis and then after that we'd say the hypothesis is either the null hypothesis is either proven or you know you know disproven basically and what AI has done is that instead of doing that you go ahead And you collect the data first and then you kind of run the data through this analysis machine which then kind of packs its way into you know what is the finding what is the hypothesis at the end and then that is the cause. So the the causality is discovered basically rather than you know proposed and then tested. the causality is kind of discovered. And so instead humans become these kind of like you know uh meat robots like doing doing the experiments getting getting the data and then feeding it into this machine and kind of asking the machine to you know find find the kind of like you know hypothesis in in in effect. Um, which is also why it feels a little bit like illegal because [laughter] it, you know, it's that's that's why a lot of people kind of say like, you know, ML like shouldn't work, but it does work. It's not supposed to, I guess. Um, so I guess that that that's
2:54:42that, you know, tell me what you think about that that way of, you know, thinking about things, right? » Yeah, I think that's going to be huge. I'm doing an episode of the cognitive revolution coming up with Alex Rivas who's the um the head of the Chen Zuckerberg Initiative, Biohub, whatever that's officially called now. And they seem to be investing a ton in just scaling out data sets with this basic approach in mind. Yeah. » Um, and I think with interpretability techniques, that's also going to be a real way to close the loop and look at, well, what has this thing actually learned? Can we make sense of it? And can we, you know, add our own layer of conceptual refinement to it.
» We've only seen, I'd say, like a modest number of examples of that really working so far, but we have at least seen enough proof points that I'm pretty convinced that that paradigm will work. » Mh. And it seems like it's probably right now just limited by data sets. If you imagine scaling out with Zuckerberg scale resources the biodata sets that we have training models to make these predictions » and then interrogating model internals for what is driving these predictions. I think we will see especially in biology where we we're you know still in so many ways in the dark. I think we will really start to illuminate a lot of that space. Um I would expect with a phase change rate of speed. I mean it just seems like it's just the the power of the intuition. I mean I think in a way it's sort of p hacking but in a way it's like I really I really come back to this visualization of shrink wrapping reality » and the the tighter you know the bigger the data set the tighter the wrap and the better that intuition is and when that intuition really starts to work it's just something that we cannot match » I see that being a a pretty key threshold and we've crossed it a little bit in a couple domains but I think the you know a whole bunch of companies right have been started to do this kind of stuff in the last » Yeah. Yeah.
» one to two years and they're they're
2:57:12really just kind of getting started. » Yeah. What one one thing that I think a lot of a lot of people do not understand like a lot of people even in ML don't understand really is that um for biological data it is very expensive uh and I'll give you an example I have I have friends doing um cancer imunotherapy kind of models and in order to get let's say uh 20 samples of a certain kind of pancreatic cancer and you ask like let's Say you go and ask like Stanford Health um and Stanford Stanford Health will tell you okay in order to on board you have to pay 100 grand and then for each sample you're going to have to pay like five grand per sample.
So your uh if you want a 100 samples that's 600 grand and for a particular kind of cancer and it's not even they're not making a profit off it. It's just, you know, it ends up in the NIH kind of funding bucket and it goes to the research organization. It's not but, you know, because it costs money to like and they and they're not trading it. They're not like none of the researchers are like, you know, making money off selling cancer samples. It's not it's not happening, right? But in order to get all of these samples, like that's the cost that you have to go through.
There's the HIPPA and then like there's making sure that it was properly collected, storing it, refrigerating it, like it has to have like um a a chain of custody from the point it was collected to the point it's been stored for how long has it been stored. The whole se you know cycle of things that has to happen. Um, and let's say, you know, if a ML researcher comes out and says, "Look, if you give us like, you know, 100,000 samples of this particular kind of cancer, we can cure it." It's not happening, right? It's it's just not happening. Like, you're you're you're fighting over like a 100 samples here or 500 samples there. And it's it's when you see people especially when you see people like from Google who enter like from the web space and they go into bio bioml and then they're just shocked because you go from like billions of data points you know a day to like here you're going to get 200 samples this year right like that's it you're going to have to cure cancer with that right um and and that's that's the sad part right like we don't we don't have we just simply don't have enough data in the biospace for a lot of things that I I always say like look the algorithms that we're going to use to cure cancer
2:59:43already exist but the data that we need to cure that is not there yet right and and the real hope is that we get to some level of sample efficiency u that that 100 samples is enough for us
