Videos

AI Engineer World’s Fair 2025 - Reasoning + RL
234:57

AI Engineer World's Fair 2025 - Evals
239:11

Retrieval + Search
240:28

AI Engineer World’s Fair 2025 - LLM RECSYS
246:10

AI Engineer World’s Fair 2025 - Tiny Teams
217:24

GPU-Less, Trust-Less, Limit-Less: Reimagining the Confidential AI Cloud
43:40

Why the Best AI Agents Are Built Without Frameworks
27:05

Are MCPs Overhyped?
7:27

Agent Continuations for Resumable AI Workflows
14:08

The 4 Patterns of AI Native Development
14:08
Transcript
Duration: 234:57
Created: Jun 9, 2025, 09:41 PM
Click any timestamp button to jump to that moment in the video
Video Content:
Reasoning + RL is an event sponsored by OpenFace, taking place on June 5, 2023, at 12:15 PM to 6:00 PM. The event features four sessions, each with a title, speaker, and time slot. Here's a detailed breakdown of the sessions: **Session 1: Training Agentic Reasoners** - **Speaker:** Will Brown - **Time Slot:** 11:15 AM to 11:35 AM - **Description:** This session focuses on training agentic reasoners, which are systems designed to reason about the world in a way that is similar to human reasoning. **Session 2: Measuring AGI: Interactive Reasoning Benchmarks** - **Speaker:** Greg Karrasch - **Time Slot:** 11:35 AM to 11:55 AM - **Description:** This session discusses measuring advanced general intelligence (AGI) through interactive reasoning benchmarks, which involve testing the system's ability to reason effectively in various scenarios. **Session 3: Post-Training: Open Models with RL for Autonomous Coding** - **Speaker:** Mikaela Leukin - **Time Slot:** 11:55 AM to 12:15 PM - **Description:** This session explores how reinforcement learning (RL) can be used to train open models for autonomous coding tasks, focusing on the practical applications and challenges involved. **Session 4: Unreasona by Effective Reasoning Distillation at Scale** - **Speaker:** Ryan Martin - **Time Slot:** 12:15 PM to 12:35 PM - **Description:** This final session delves into the concept of effective reasoning distillation at scale, discussing how large-scale data and models can be distilled to create more efficient and effective reasoning systems.
Audio Transcript:
Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. A few of them that are very important. All of them have different implementation details. But in general, the idea is you have a bunch of tasks, like versions of your problem, which are essentially prompts. You have rollouts, which are just completions, potentially involving many steps of interactions, but like one sequence of stuff happening. And then you have evaluation, potentially interleaved throughout or at the end of the sequence. And what you're estimating is the advantage.
Video Content:
Reasoning + RL is an event sponsored by OpenPose. The event took place on June 5, 2023, and featured three sessions: Training Agentic Reasoners, Measuring AGI: Interactive Reasoning Benchmarks, and Post-Training Open Models with RL for Autonomous Coding. The speakers were Will Brown, Greg Karrasch, and Ryan Martin. The event was organized by Microsoft, AWS, and Snoopie.
Audio Transcript:
The advantage here is the idea that sometimes your model will be better than others. Like these LMs are all nondeterministic. You have temperature above zero, you have different things happen in different rolls of the dice. And this forking process of saying, like, OK, this time it did better than that time, why was it different?
Video Content:
RL algorithms in one slide • get a bunch of tasks (prompts) • do a bunch of rollouts (completions) • eval your rollouts (reward functions/models) • estimate advantages • PPO: train a token-level "critic" • GPO: many forking rollouts • DPQ: ??? • maximize advantages (without changing too much)
Audio Transcript:
RL is really about saying like, okay, this is the actual thing that changed, that resulted in the reward being better, the eval being better. This is the token at which I went down the good path versus the bad path. And whether you're doing PPO or GRPO like this is the mechanism by which
Video Content:
RL algorithms on one slide • get a bunch of tasks (prompts) • do a bunch of rollouts (completions) • eval your rollouts (reward functions/models) • estimate advantages • PPO: train a token-level "critic" • GPO: many forking rollouts • DPQ: ??? • maximize advantages (without changing too much)
Audio Transcript:
you get the signal of like, you have something that sometimes went better, sometimes went worse. Now you can kind of very surgically have the model learn to do more of the good stuff without changing too much overall. I think this is also kind of maybe a reason why DPO, I think people were hoping DPO would like really work well. In my view, DPO does not necessarily have
Video Content:
RL algorithms on one slide • get a bunch of tasks (prompts) • do a bunch of rollouts (completions) • eval your rollouts (reward functions/models) • estimate advantages • PPO: train a token-level "critic" • GPO: many forking rollouts • DPQ: ??? • maximize advantages (without changing too much)
Audio Transcript:
this like fine-grained advantage estimate. Like it's not really clear just from like a full good completion and a full bad completion where you're really getting the signal about these complex branching processes. PPO has this, but it's also very expensive. GRPO, I think, has taken a lot of people kind of by storm in terms of being a very nice, like, middle ground where it's more computationally efficient.
Video Content:
RL algorithms in one slide • get a bunch of tasks (prompts) • do a bunch of rollouts (completions) • eval your rollouts (reward functions/models) • estimate advantages • PPO: train a token-level "critic" • GPO: many forking rollouts • DPQ: ??? • maximize advantages (without changing too much)
Audio Transcript:
It's like simple to implement, but also it does have this kind of forking process that comes just from sampling. There's also just too many papers. So I think a lot of people just see a new paper every day and are like, do I have to read this one? And I feel that too. Like, I think it's difficult to know up front, like, which of these are going to be important, which of them are just going to be
Video Content:
RL algorithms in one slide • get a bunch of tasks (prompts) • do a bunch of rollouts (completions) • eval your rollouts (reward functions/models) • estimate advantages • PPO: train a token-level "critic" • GPO: many forking rollouts • DPO: ??? • maximize advantages (without changing too much)
Audio Transcript:
like noise, especially because lots of them have very sensationalist titles, like, oh, Quen doesn't work, or like, or, everyone, everything only works with Quen is, like, kind of true, but like, there's also more to the story than that, and I think there's like different implementation details of like, oh, if you change the loss function like this and this experiment then it works. And I think for most people, it is best to just, like, kind of set this aside and to not get too caught up
Audio Transcript:
in the individual details of individual experiments and individual papers and kind of think more holistically about what is the process of reinforcement learning doing? What implementation details am I willing to kind of leave to other people to figure out and eventually come to me with like software that like has the knob set correctly? And which pieces are actually important for solving the problems I care about.
Audio Transcript:
And so for a lot of people, I think the things that are going to be really interesting are things that are relating to actual software, to actual problems that they want to sell in the world, and agents, I think, are kind of the, that they want to sell in the world. And agents, I think, are kind of the instantiation of that where this makes sense. And the thing that makes an agent to an agent is tools. the ability to interact with an environment with a system. A lot of people here are very excited about MCP at the conference.
Video Content:
A speaker stands at a podium, delivering a presentation. The background is a slide with text and logos related to AI and cloud computing. The speaker gestures with their hands as they speak, emphasizing points. The slide includes titles such as "too many papers," "AIE," "Microsoft Azure," "AWS," and "Smoq." The speaker appears to be discussing topics related to artificial intelligence, cloud services, and possibly academic research.
Audio Transcript:
MCP is the conference. Like, MCP is just tools. MCP is about giving your LM the ability to, like, interact with stuff, to go solve problems that involve changing files, making requests, editing code, running code. And so I think these are the papers that I get excited about because they feel like, like there's parts of the puzzle that are not fully solved yet, like what's the right way to do all of this?
Video Content:
too many papers AIE 2023-05-18 University of California, Berkeley San Francisco, CA 94132 https://aie.ucb.edu/ Speaker: Dr. Michael Chen, Assistant Professor, Department of Computer Science & Engineering, University of California, Berkeley Abstract: In this talk, we will discuss the challenges and opportunities of AI research and development. We will explore the current state of AI research, including the latest advancements and trends. We will also discuss the ethical and societal implications of AI, as well as the potential impact of AI on various industries and fields. Co-Sponsored by: UCB/Stanford Language Institute Microsoft AWS Smoq
Audio Transcript:
Like, there's still some open questions. But I think those are getting kind of refined. We're starting to see more and more, but a lot of the code, the tools we have out in the wild, they're like, go do this. Like if you want to go play around with RL, most code bases are like very set up for like either code and math tasks or things that are quite similar to that. That's kind of my fault.
Video Content:
A speaker is presenting at an event titled "too many papers". The presentation includes slides with text and images related to AI and machine learning. The speaker discusses open-source reinforcement learning (RL) and mentions that it is stuck in a code + math Q&A land. The slides also feature graphs and diagrams illustrating the progress and challenges in RL research.
Audio Transcript:
I had a snippet go viral that was like, here's how you do RL on GSMAK, which is like a kind of easy math data set. And then I think I've seen a lot of people like stick with this as like, oh, we're going to RL on math. And like, this is also just like math is easy to evaluate. And I think people are, writing evils are hard. There's a whole track going on
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing open-source Reinforcement Learning (RL) research. The presentation is titled "open-source RL is stuck in code + math Q&A land." The speaker is standing behind a podium with a microphone, gesturing with his hands as he speaks. The background includes a large screen displaying slides with text and graphs related to the topic. The slides include phrases like "AIME ESSN," "Open-Source RL," "Q&A land," and "Microsoft AWS SMOOT." There are also bullet points on the slides, such as "1. Open-Source RL is stuck in code + math Q&A land," "2. Lack of real-world applications," "3. Lack of practical use cases," "4. Lack of real-world applications," "5. Lack of practical use cases." The speaker appears to be explaining the challenges faced by open-source RL research and the need for more practical applications.
Audio Transcript:
in parallel of this about like how to build a good eval. And so I think a lot of researchers gravitate towards things that look like the benchmarks that are also really easy to eval because there's like a very clear signal of like, okay, this thing is like right, this thing is wrong, good, okay, we're doing RL. But like real world tasks are messier than that. We are not going to like get great software systems
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing the challenges faced by open-source Reinforcement Learning (RL) research. The presentation includes slides with text and graphs, highlighting issues such as code complexity and mathematical Q&A. The speaker is standing behind a podium, gesturing towards the slides, and appears to be explaining the current state of RL research and its limitations.
Audio Transcript:
just by like hill climbing on whatever question answer benchmark is popular today. What we're going to do is we we're going to have to do, is start thinking about the actual systems at hand and the challenges that emerge when we're trying to design these rewards. And so like reward hacking is like a real thing. I think this is one of the lessons that like RR works, but also it's not always going to work.
Video Content:
open-source RL is stuck in code + math Q&A land reward hacking is a thing AIE Microsoft AWS SMR
Audio Transcript:
There are things that can go wrong. And to me, reward hacking is really a message about the difficulty of building good evals. Like, what you really want with an eval is for it to be easier for your model to do the task than to hack the Eval. You want to build a reward signal that actually captures what you care about, where gaming it is like more difficult than not gaming it.
Video Content:
reward hacking is a thing AIE Reward hacking is the practice of using rewards to motivate people to do something they might not otherwise do. It involves creating incentives that are strong enough to override the natural disincentives that would normally prevent someone from taking action. For example, if you want to get someone to come to a meeting, you might offer them a reward for attending. The reward could be something like a gift card, a chance to win a prize, or a promotion at work. The key is to make the reward strong enough to outweigh the natural disincentives that would normally prevent someone from attending the meeting.
Audio Transcript:
If you can, if the model can learn to do the task directly just by doing what you want it to do in the spirit of the task, then that is what will happen. It will flow in the path of least resistance. This is like models just want to learn, but they want to learn to do better on reward signals. And so your reward signals have to point in the direction of the thing you actually care about. Otherwise, like, models will find cheats. And I think thinking about these things
Video Content:
reward hacking is a thing AIE Context: The speaker is discussing reward hacking, which involves using rewards to motivate individuals to perform certain behaviors. The speaker provides an example of how reward hacking can be used to encourage employees to complete tasks. Example: A company could offer a bonus to employees who complete a certain number of tasks within a week. This would incentivize employees to work harder and complete more tasks, leading to increased productivity.
Audio Transcript:
in combination kind of points a little bit towards a direction that I think is going to be very promising. And there's some very early signs that like this actually can work, which is like, when R1 came out, I was kind of like speculating, like, what's next? What are the things that are going to unlock this sort of technique being used more generally? And people talk a lot about generator, verify, or gaps.
Video Content:
Beyond Math and Code: AIE
Audio Transcript:
Like, what are the differences between, like, solving a problem versus checking if you have a solution? And a lot of problems, like like are much easier to check than solve, but this isn't like a binary thing. This is a spectrum of how difficult is it to verify a thing. But there's some kind of signs that you kind of can do evaluations on more ambiguous tasks by just breaking them down in smaller pieces and by using
Video Content:
Beyond Math and Code: AIAE
Audio Transcript:
alms as subroutines in your evaluations, like LM is a judge on steroids, or maybe you want to actually like train a specialized at LM who is really good at doing these fine-grade evaluations. I like using the term rubric as a conceptual general umbrella around reward models, reward functions, LM as judge setups, like the criteria on which you are evaluating a thing.
Video Content:
Beyond Math and Code: AIE
Audio Transcript:
There's a cool paper from DeepSeek that I was found very exciting when it came out a couple months ago about like how to train reward models that like generate these rubrics on the fly. There was a paper very recently that does this for creative writing and kind of found that, like, yes, you actually can train reward models that will come up with nuanced, fine-grained evaluation criteria for a task on the fly given the actual problem.
Video Content:
Beyond Math and Code: AIE
Audio Transcript:
And this gives you something that results in a very fine-grained score that allows you to actually do RL and keep getting better. And I think, like, this is an area that I'm really excited about to keep watching, but also like multi-turn. Multi-turn is probably where we're headed. We want to do a genetic search. We want to do tool calls, software, games, long-hrizen planning, computer
Video Content:
Beyond Math and Code: AIE
Audio Transcript:
use memory, scaling on tool calls lets you solve harder problems. And so how do we actually do this? What's the way to go about building multi-turn agentic systems to do, and that we can use RL with. And I think the conceptual pieces here are environments are basically harnesses, rewards are basically evals, tasks, search prompts, and your policy, in the RL sense,
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing advancements in artificial intelligence (AI) and machine learning. The slide behind the speaker highlights the future of multi-turn agentic RL (Reinforcement Learning), emphasizing key areas such as agentsearch, tool call scaling, software agents, self-play in games, long-horizon planning, computer use, and chat memory. The presentation also touches on the relationship between environments, rewards, tasks, and policies, using a diagram to illustrate these concepts. The slide includes logos for Microsoft, AWS, and SMOOT, indicating the involvement of these companies in the discussion.
Audio Transcript:
hopefully should just be as simple as like an LLM API. I think the programming interface that makes sense for a lot of people is to have an API that you're writing code as if it's just a normal agent in a loop, but then this is a thing that you can use to go do RL. And so that's what I've been building over the past couple months. I maintain a repo called Verifiers.
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing artificial intelligence (AI) and machine learning topics. The presentation includes slides with text and diagrams related to AI environments, rewards, tasks, and policies. The speaker is standing behind a podium with a microphone, gesturing as they speak. The background features logos for Microsoft, AWS, and SMOI, indicating sponsorships or partnerships. The slides include terms such as 'environments = harnesses,' 'rewards = evals,' 'tasks = prompts,' and 'policy = LLM API.' There is also a diagram showing a cycle involving 'Input,' 'Model,' 'Output,' and 'Feedback,' with arrows pointing in both directions.
Audio Transcript:
It's finally on PIP out in the world. You can just install it, but it's been a long time coming. And what it really is is a toolkit of these pieces to make it so that building an agent that you can actually train with RL feels just like building an agent that you can actually train with RL feels just like building an agent. So the interaction protocol here is quite simple. This is the entire rollout function on the left of like what happens in the code when
Video Content:
verifiers: what if agentic RL was easy? AIE it should feel like building an agent + hitting 'run' Microsoft AWS SMOO
Audio Transcript:
you're running an agent to do RL, which is that you kind of set up some initial state stuff, have a while loop for is it done yet? If it's not done, do a turn. And the thing you're passing here is a client object that's just an opening eye compatible API. And I think this is the kind of interface that you really want, if you want people to be able to go from their agent applications to something that's trainable, something they can use with
Video Content:
it should feel like building an agent + hitting 'run'
Audio Transcript:
RL. It's been a lot of fun thinking about, like what are the abstractions, what are the pieces here? And so like there's things like parsers and rubrics that I think are like nice building blocks that you sometimes want to use. You can also like not use them if you don't want to, but like I've tried to make it fun and user friendly. The other day. I tried to make it fun and user-friendly. The other day I was like, let's train a wordel agent. I think this was like a fun little toy problem where it's like, it's not that hard of like a game for us as humans, but like,
Video Content:
it should feel like building an agent + hitting 'run' AIE RL from wordle feedback 1. Start with a wordle: 'Toothpaste' 2. Ask the user: 'What do you want to do?' 3. The agent responds: 'I'm going to make a toothpaste.' 4. The user provides feedback: 'Good!' (reward) 5. The agent learns from the feedback and improves its responses. 6. Repeat steps 2-5 until the user is satisfied.
Audio Transcript:
it's actually like kind of tricky to get your code to be this sort of thing where you have this like multi-turn interaction protocol that you actually can do learning with. But now it's like much easier. Like the code to do these things is quite simple. And the reward functions can kind of be relatively simple for this sort of setup where it's like, okay, you want to reward it for like solving the thing eventually, but also like give it more rewards for doing it in less turns,
Video Content:
A speaker is presenting at an event titled "AI & E". The slide behind him displays the title "RL from wordle feedback" along with some text and code snippets. The speaker gestures towards the screen while explaining the topic. The background includes logos for Microsoft, AWS, and SMOJ.
Audio Transcript:
and like, this is a 7B model. It works reasonably well. But one of the reasons it works, which I'll talk about in a sec, is SFT warm-up as a way of kind of lowering the barrier of entry. Like this, the code as it is, is very much set up so that like your environments for RL are also just like synthetic data loops or evals where you can plug in Claude or DeepSeek or Open AI and test.
Video Content:
A speaker is presenting at an event titled "RL from wordle feedback". The slide behind the speaker displays code snippets and a graph. The code includes functions for generating random words, processing wordle feedback, and calculating scores. The graph shows the performance of the RL algorithm over time. The speaker discusses the implementation details and the results of the experiment.
Audio Transcript:
So you don't have to do RL to debug. You can debug with an API in terms of seeing, is this a good eval? Is this a good reward? Once you're kind of comfortable with it, you can use whatever API you like that you are allowed to use and make synthetic data, do some SFT on it, and now you can start doing RL, and this like helps a lot with small models. I think there's a lot of efficiency challenges that are, like,
Video Content:
A speaker is presenting at an event titled "AIE". The presentation focuses on Reinforcement Learning (RL) techniques applied to word feedback. The slides display code snippets and graphs related to RL algorithms, specifically mentioning "word2vec" and "glove" models. The speaker discusses how these models can be used in various applications, such as natural language processing and sentiment analysis. The background includes logos for Microsoft, AWS, and SMOJ, indicating the event's sponsors or partners.
Audio Transcript:
I've been kind of hard at work trying to solve in terms of having all of your computation be utilized effectively, having everything be fully async, so you don't have to worry about batching and that your trainer and your inference can kind of go at the same time. You can be like a little bit off policy. A lot of engineering that I'm hoping, like, if you want to worry about that, great, dig into it, fork the repo, mess with things.
Audio Transcript:
If you don't want to, you shouldn't have to. And like, the idea here is that this should become something that more people are trying out, more people are trying out, more people are having fun with exploring and getting a feel for it, because if it's going to be a thing we have to worry about, if this is the future of building better agent models for your applications, like now's a good time to start.
Video Content:
A speaker is presenting at an event titled "Towards Everything Async." The presentation includes slides and a live demo on a screen behind the speaker. The slides discuss the use of SFT (Self-Supervised Training) warm-up and small models on a few GPUs. The presenter explains how this approach can be effective for training large models efficiently. The screen also shows a live demonstration of the process, highlighting the benefits of using a few GPUs for training.
Audio Transcript:
And so this stuff is set up so you can, like, on a couple of GPUs, like, do a lot of interesting research. The barrier of entry is much lower now than it used to be. I have a lot of fun doing this on like a couple of GPUs. We sell GPs, by the way. Thanks, everybody. I don't think we have time for questions, but yeah.
Video Content:
A speaker is standing at a podium, presenting on a topic related to AI and machine learning. The background features a large screen displaying a presentation slide with the title "SFTW Warmup + Small Models = Fun on just a few GPUs." The slide includes details about a study conducted by Microsoft, AWS, and SMOO, focusing on the efficiency of using small models on fewer GPUs. The speaker appears to be explaining the benefits and implications of this research.
Audio Transcript:
Thank you. All right. Thank you so much, Will. Next, up on the stage we're going to have Greg Kamrat he is the founder one of the co-founders and president of the Ark Prize Foundation. Arc is fantastic.
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing the topic of SFT (Self-Supervised Training) warmup and its application on small models. The presentation includes slides and a live demonstration on a screen behind the speaker. The speaker explains how SFT warmup combined with small models can be effectively used on just a few GPUs, highlighting the efficiency and benefits of this approach. The audience appears engaged, and the event is sponsored by Microsoft, AWS, and SMO.
Audio Transcript:
They've been creating some of the highest signal evaluations. As we're on our path towards AGI. They're basically helping us measure where we're going. I'm very excited to hear. I think we might even get a sneak preview today of the next version of what they're doing. So anyway, welcome to the stage. Craig Cameron. Thank you. Beautiful.
Video Content:
A speaker is presenting at an event called "Reasoning + RL" by OpenAI. The event features sessions on training agentic rossioners, measuring AI's interactive capabilities, and unmasking effective reasoning distillation at scale. The speaker is discussing the topic of artificial intelligence and its applications in various fields.
Audio Transcript:
Today, we are going to talk about why AI benchmarking is about to get a lot more fun. But before we do, we need to go over some cool demos here. So I love the Claude Place Pokemon demo. There's something about really special about seeing this little embodied agent, make its own decisions, go and play Pokemon for us from our childhood here.
Video Content:
A man is standing at a podium, speaking into a microphone. He is wearing a black shirt and appears to be presenting something. The background is dark with a grid pattern. On the left side of the screen, there is a logo for AIE (Artificial Intelligence Expo). Below the logo, there are logos for Microsoft, AWS, and SMOO. The title of the presentation is "MEASURING ARTIFICIAL INTELLIGENCE: Interactive Reasoning Benchmarks." The presenter is discussing the topic of measuring artificial intelligence, specifically focusing on interactive reasoning benchmarks. He mentions that this is a new area of research and development in AI.
Audio Transcript:
Now, Open AI got in on the same game. I thought this was awesome. And then just the other day, Gemini beat Pokemon. I like seeing the little robots with AGI on top of their head. That must mean that we're already there, right? Well, if we have these agents playing Pokemon and it's already doing it and beating it, that means the game's over, right? We're all done. and it's already doing it and beating it. That means the game's over, right? We're all done. Well, not quite, because with Claude plays Pokemon,
Video Content:
A speaker is presenting at a conference or event, standing behind a podium with a laptop in front of them. The screen behind the speaker displays various slides related to AI and machine learning. The slides include text about Azure Functions, Azure Stream Analytics, and other Azure services. There are also images of code snippets and diagrams related to these technologies. The speaker appears to be explaining the features and benefits of these Azure services, using the slides as visual aids. The overall theme of the presentation seems to be focused on cloud computing and artificial intelligence.
Audio Transcript:
we saw that it would get stuck in the same place for three days. It would need interventions. It would hallucinate different actions. And not only that, there was a ton of Pokemon training data that was within the model itself. So although this is a really cool example about an agent exploring a world, there are a lot of things that we can go and improve on. So as Kyle was saying, my name is Greg Kamrad, of ArcPras.
Video Content:
A speaker is presenting at a podium during an event. The screen behind the speaker displays a tweet from @AIE about a security incident involving a Discord bot. The tweet mentions that the bot was used to distribute malware and that the incident occurred after a phishing attack. The speaker appears to be discussing the incident and its implications. The screen also shows logos for Microsoft, AWS, and Snoop, indicating sponsorships or partnerships.
Audio Transcript:
We are a non-profit with the mission to be a North Star Guide towards Open AGI. We were founded last year by Francois Chalet and Mike Canoop. And just last December, we were invited by Open AI to join them on their live stream to co-announce the O3 preview results on ARC AGI. Now, there's a lot of AI benchmarking companies out there,
Video Content:
A speaker is standing at a podium, addressing an audience. The background features a large screen displaying information about the ARC Prize Foundation, including details such as the number of years since its inception, the number of prizes awarded, and the names of the recipients. The screen also shows logos of Microsoft, AWS, and Smoo. The speaker appears to be discussing the ARC Prize Foundation and its achievements.
Audio Transcript:
but we take a very opinionated approach as to how we should do this. And our opinion is that the best target we should be aiming for is actually humans. And the reason why we think that is because we see that humans are the one proof point of general intelligence that we know about. And if we use humans as the target, that does two things for us, because what we do is we come up
Video Content:
A speaker is giving a presentation at an event. The background features logos of Microsoft, AWS, and Snoop. The speaker is standing behind a podium, gesturing with his hands as he speaks. The screen behind him displays slides with text and images related to artificial intelligence (AI) and general intelligence. The audience is visible in the foreground, attentively listening to the presentation.
Audio Transcript:
with problems that are feasible for humans but hard for AI now while we do that that does two things. Number one is it creates a gap. And when you have that gap, you can start to measure. Well, how many problems can we come up with that humans can still do, but AI can't? And then number two is it guides research. So you can quantify that class of problems and then go tell researchers, hey, there's
Video Content:
THE TARGET: HUMANS (Product of General Intelligence) AIE Microsoft AWS SMOJ
Audio Transcript:
something really interesting going on on this side of the problem. That's something that we need to go check out from there. All right? So if we're going to measure artificial general intelligence based off of humans, we need to actually define, well, what is general intelligence? And there's two definitions that I love to quote. The first one was by John McCarthy. And he says that AI is the science and engineering of making machines do tasks,
Video Content:
THE TARGET: HUMANS (Proof of General Intelligence) AIE Microsoft AWS SMOO
Audio Transcript:
and this is the important part, that they have never seen beforehand, and they have not prepared for beforehand. This is very important because if you've seen a class of problem beforehand, if it's already in your training data, then you're simply just repeating memorization. You're not actually learning anything new on the fly. Right? anything new on the fly, right? The second person I like to quote on this is actually
Video Content:
An AI expert discusses the integration of AI into various industries, specifically focusing on the use of AI in healthcare. The speaker, identified as Jan Körner, provides insights into how AI is being applied to improve patient care and medical research. He mentions the integration of AI with cloud services like Azure and AWS, highlighting the benefits of these technologies in enhancing data processing and analysis capabilities. The discussion also touches on the ethical considerations surrounding AI in healthcare, emphasizing the importance of transparent communication and responsible development practices.
Audio Transcript:
Francois himself, and he put it very eloquently within just three words. And he calls intelligence, skill, acquisition, efficiency. And this is really beautiful here because skill acquisition, can you learn new things? And not only that, but how efficiently can you learn those new things? And humans are extremely, of spoiler, humans are extremely efficient at learning these new
Video Content:
An AI expert discusses the integration of AI into various industries, emphasizing its potential to revolutionize business processes. The speaker highlights the importance of AI in enhancing efficiency and productivity, particularly in the context of healthcare and manufacturing. They mention specific examples such as AI in drug discovery and AI-assisted production line optimization. The discussion is set against a backdrop of modern technology and innovation, with references to companies like Microsoft and AWS.
Audio Transcript:
things. So Francois proposed this definition in his 2019 paper on the measure of intelligence, but he went further than that. He didn't just define it. He actually proposed a benchmark to see, can a human or an AI, can it learn something new, and then go repeat what it learned? And this is where the RKGI version 1 benchmark came out. So over here on the left-hand side, this is the learn the skill portion.
Video Content:
A black and white video featuring a speaker presenting at an event. The speaker is standing in front of a screen displaying slides related to AI and machine learning. The slides include text such as "AIE", "Jean-Marc", "paris@openai.com", "Fernando Ochliadis", "SMLL-Adaptation Efficiency", and "Microsoft AWS SMLM". The speaker appears to be discussing topics related to AI and machine learning, possibly giving a presentation or lecture.
Audio Transcript:
This is what we call the training portion. And what we show you is a transformation from an input to an output grid. And then the goal for the human or the AI is to look at it and say, hmm, what's going on here? And then on the left, we actually ask you to demonstrate that skill. So it's a little mini skill you learn on the left, and we ask you to demonstrate it on the right. And if you can successfully do it, and this is what it looks like, it's just the grid editor here,
Video Content:
A speaker is standing at a podium, presenting a slide titled "ARC-AIE: Measuring the ability to adapt to tasks you cannot prepare for." The slide features three images: a red square, a blue line, and a pyramid. The speaker discusses the concept of ARC-AIE, which stands for Adaptability, Resilience, and Creativity. The speaker mentions that ARC-AIE is a new metric being developed by Microsoft, AWS, and SMOFT. The speaker also mentions that the metric is designed to measure an individual's ability to adapt to tasks they cannot prepare for, such as unexpected challenges or changes in the environment.
Audio Transcript:
then yes, you've learned what the transformation is, and you've actually applied this. And so you're showing a non-zero level of generalization as you go through this. So our benchmarks, RKGI2, this is the most recent one, it has over a thousand tasks in it. And the important part here is each one of these tasks is novel and unique. And what I mean by that is the skills required for one of them,
Video Content:
A speaker is presenting on a stage, standing in front of a large screen displaying various slides related to AI and machine learning. The slides include images of neural networks, text discussing AIE (AI-Enabled Intelligence), and logos of Microsoft, AWS, and SMOO. The speaker gestures towards the screen as they explain the concepts being presented.
Audio Transcript:
we will never ask you to apply that same skill to another task. This is very important because we're not testing whether or not you can just repeat the skill you've already learned, but we want to test all the little mini skills that you can do over time and see you can see if you can actually demonstrate those. And if we're going to back up that humans can actually do this, well, we need to go get first party data. So our group as a nonprofit, we went down to San Diego
Video Content:
A presentation is being given in front of a large screen displaying various images and text related to AI and machine learning. The presenter, dressed in a suit, stands at the bottom right corner of the screen, gesturing towards the content on the screen. The screen shows multiple frames of a conference room setup, with tables arranged in rows and chairs placed around them. The room appears to be a modern, well-lit space with large windows and a high ceiling. The presenter seems to be explaining the setup and features of the conference room, possibly discussing the integration of AI and machine learning technologies.
Audio Transcript:
and we tested over 400 people. So rented a bunch of computers and we did this in person to prefer to have data privacy. And we made sure that every single task that was included in ARC AGI was solvable by people. So this isn't just an aim here. We're actually doing the work to go and do that. But if we think about it, there's actually quite a bit of human-like intelligence
Video Content:
A video presentation is being delivered by a speaker in front of an audience in a conference room. The speaker is using a laptop and a microphone, and there are several people seated at tables in the background. The room has large windows, and the lighting is bright. The presentation includes slides with text and images, and the speaker is gesturing with his hands as he speaks.
Audio Transcript:
that's out of scope from what we call a single turn type of benchmark. With Arcagi, you have all the information presented that you need right at test time. You don't need to do any exploring or anything, and it's all through single turn. So if we're going to be measuring any human-like intelligence, and I would argue that if you are going to measure human-like intelligence, it needs to be interactive by design.
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing advanced AI and machine learning technologies. The background features a large screen displaying various charts and graphs, indicating data analysis or performance metrics. The speaker is standing behind a podium with a laptop, gesturing towards the screen as they explain the content. Logos of Microsoft, AWS, and SMOJO are visible on the screen, suggesting a collaboration or sponsorship. The overall setting is professional and informative, aimed at educating or informing an audience about AI advancements.
Audio Transcript:
And what you need to have is you need to be able to test the ability of an agent, whether that be biological or artificial, to explore an open world, understand what goals it needs to do, and ultimately look at the rewards and go from there. So this is actually very in line with what Rich Sutton had just published within his paper,
Video Content:
Welcome to the Era of Experience David Silver, Richard S. Sutton Speaker We need to understand what artificial intelligence is about in order to experience it properly. We need to understand what AI is about in order to experience it properly. We need to understand what AI is about in order to experience it properly. The Era of Human Data Artificial intelligence has been a remarkable milestone since we started working on the question of how to make machines think. It's been a remarkable milestone since we started working on the question of how to make machines think. It's been a remarkable milestone since we started working on the question of how to make machines think. In terms of mapping worlds, we've achieved something amazing. A single GPU can simulate a world with 10 million agents. A single GPU can simulate a world with 10 million agents. A single GPU can simulate a world with 10 million agents. When building models, we've reached a point where we can train them at a speed that's comparable to human cognition. We can train them at a speed that's comparable to human cognition. We can train them at a speed that's comparable to human cognition.
Audio Transcript:
Welcome to the Era of Experience. And he argues that if we want agents that will be readily adaptable to the human world, they need to engage with the open world. They need to collect observational data. And they need to be able to take that data to build a world model and make their own rules and really understand what it is or else you're just going to have the human ceiling the human data ceiling going forward from here.
Video Content:
Welcome to the Era of Experience David Stern, Richard L. Sumner Summary We need to understand how artificial intelligence can enable us to experience more personalized, relevant, and engaging interactions with our customers. This requires a deep understanding of human data, which includes both structured and unstructured data. Structured data is easily accessible and can be used to create predictive models, while unstructured data, such as customer feedback and social media posts, provides valuable insights into customer sentiment and behavior. By combining these two types of data, we can create a more comprehensive view of our customers and provide them with a more personalized experience.
Audio Transcript:
If we're going to be able to build this, we're going to need a new type of benchmark that gets out of the single turn realm. And this is where interactive, a single turn realm. And this is where interactive reasoning benchmarks are going to come in. Now, an interactive reasoning benchmark is going to be a benchmark where you have a controlled environment, you have defined rules, and you may have sparse rewards where an agent needs to navigate to understand what is going on
Video Content:
Welcome to the Era of Experience David Silver, Richard S. Sutton Abstract We need to extend what is called artificial intelligence to enable the creation of superior experiences for humans. We believe that this can be achieved by combining human data with machine learning algorithms. The key to this approach is to use human data to train machine learning models, which can then be used to create personalized experiences for individuals. This approach has the potential to revolutionize the way we interact with technology.
Audio Transcript:
in order to explore and complete the objective from here. Now, there's an open question as to, all right, if our aim is interactive reasoning benchmarks, what is the medium in which we're going to actually execute these benchmarks in. And it turns out that actually games are quite suitable for interactive reasoning benchmarks.
Video Content:
Interactive Reasoning Benchmarks (AIe) Games as the Medium Microsoft AWS SMOJO
Audio Transcript:
The reason for this is games, they're a very unique set of intersection of complex rules, defined scope, and you have large flexibility into creating these types of environments that you can then go put different artificial systems in or biological systems with it. Now, I know what you may be asking here. Wait, Greg, didn't we already do games?
Video Content:
Interactive Reasoning Benchmarks (Sames as the MidJurn) AIE Microsoft AWS SMOJO
Audio Transcript:
Didn't we do this 10 years ago? We already went through the Atari phase. Well, yes, we did, but there's actually a huge amount of issues with what was going on during that realm there. Not even just starting with all the dense rewards that come with the Atari games. There was a ton of irregular reporting, so everybody would report their own performance on these different scales and was tough to compare these models with it. There was no hidden test at the cam.
Video Content:
A video game screen displaying a classic arcade-style shooter game. The game features a player character at the bottom of the screen, shooting at enemies and obstacles. The screen is divided into multiple sections, each representing different aspects of the game. The top section shows the player's score and lives remaining. The middle section displays the player's ammunition count and the number of enemies defeated. The bottom section shows the player's current level and the time remaining. The game has a retro aesthetic with pixelated graphics and simple animations.
Audio Transcript:
And then one of my biggest gripes with the Atari phase was that all the developers, they already knew what the Atari games were. So they were able to inject their own developer intelligence into their models themselves, and then all of a sudden the intelligence of the performance, well, that's getting barred from the developer. That's not actually getting done by the model itself from there. So if we were able to create a benchmark that overcame these shortcomings,
Video Content:
A video presentation is being given by a speaker who is standing in front of a large screen displaying an animated game map. The map is divided into different sections, each labeled with various game elements such as 'Server Requests,' 'Training Data,' 'Mimicry Actions,' 'Killer Set-Up,' 'Computer Reporting,' 'DPS/Security,' 'Non-Fiction,' and 'SMP15217.' The speaker appears to be explaining the different components of the game map, possibly discussing how they interact with each other. The screen also shows logos for Microsoft, AWS, and SMOO, indicating the involvement of these companies in the game development or related technology.
Audio Transcript:
well, then we'd be able to make a capabilities assertion about the model that beat it that we've never been able to make beforehand. And so to put it another way that's a bit more visual, we know that AI can beat one game. This is proved. AI can beat chess, AI can be go. We've seen this many, many, many times here. And we know that AI can beat 50 games,
Video Content:
A video presentation about AI and its applications in various fields, including gaming and cybersecurity. The presenter discusses the use of AI in detecting and mitigating cyber threats, such as malware and ransomware. The video features a game called 'Beat1' and a chess game between a human and a robot. The presenter also mentions the use of Microsoft and AWS in developing AI solutions.
Audio Transcript:
with 50 known games with unlimited compute and unlimited training data. We've seen this happen with Agent 57 and Mew Zero. But the assertion that we want to make is, well, what if AI beat 100 games that the system has never seen beforehand and the developer has never actually seen beforehand either.
Video Content:
A presentation slide is displayed on a screen, featuring the text "AIE Beat 50 games." Below this, there is an illustration of two robots sitting at a desk with a computer monitor in front of them. The text next to the robots reads "AGENTS" and "MUTANTS." The background of the slide is black, and there is a logo of Microsoft and AWS in the bottom right corner.
Audio Transcript:
If we were able to successfully put a test or put AI to this test, then we could make the capabilities assertion about that AI that we don't currently have in the market right now. And I'm excited to say that that's exactly what ARC is going to go build. So this is going to be our version 3 benchmark. Today is a sneak preview about what that's going to look like.
Video Content:
A presentation is being given at an event called AIE. The speaker is standing in front of a large screen displaying a retro video game scene featuring a robot and a question mark. The text on the screen reads "Beat 100 games." and "That neither the system nor developer team went defaults." The speaker is discussing the topic of AI and its applications in gaming. The screen then transitions to show the text "ARC-AGLI-3" and "Microsoft AWS SMOJO." The speaker continues to discuss the topic while the screen displays these additional texts.
Audio Transcript:
And this is going to be our first interactive reasoning benchmark that is going to come from Mark and I want to jump into three reasons why it's very unique here. So the first one is much like our current benchmark we're going to have a public training and a public evaluation set. So the reason why this is important with our public training, call it on the order of about 40 different novel games.
Video Content:
A speaker is presenting at a conference or event, standing behind a podium with a laptop in front of them. The background features a large screen displaying various slides related to artificial intelligence (AI) and machine learning. The slides include text such as "AIE," "ARC-AGI-3," "PUBLIC TRAINING," "PRECISE EVALUATION," "LDS NOVEL ENVIROMENT," "Microsoft," "AWS," and "SMOJO." The speaker appears to be explaining or demonstrating something related to AI and machine learning, possibly discussing public training programs, precise evaluation methods, and novel environments.
Audio Transcript:
This will be where the developer and the AI can understand the interface and understand kind of what's going on here. But all performance reporting will happen on the private evaluation set. And this is very important because on this private evaluation set, there's no internet access allowed, so no data is getting out about this. The scores that come out of private evaluation set will have been done by an AI that has never seen these games
Video Content:
A speaker is presenting at a conference or seminar about artificial intelligence (AI) and machine learning. The presentation includes slides with various images and text related to AI training and evaluation. The speaker is standing behind a podium, gesturing towards the screen as they explain the content. The slides display different types of data and algorithms used in AI, along with logos for Microsoft, AWS, and SMOO. The overall theme of the presentation is focused on showcasing advancements in AI technology and its applications.
Audio Transcript:
before and neither has the developer seen them. So we can authoritatively say that this AI has generalized to these open domains here. Now the second important point about what RKGI3 is going to have is it's going to force understanding through exploration. One of my other gripes with current game benchmarks out there is you give a lot of instruction to the actual
Video Content:
A presentation is being given on a screen displaying various images and text related to artificial intelligence (AI) and machine learning. The presenter stands at the bottom left corner of the screen, gesturing towards the slides as he speaks. The slides show different stages of AI training and evaluation, including public training, private evaluation, and novel environments. The presenter discusses the importance of understanding AI through exploration and measurement, emphasizing that traditional methods cannot prepare for this. The presentation is sponsored by Microsoft, AWS, and Snoop. The overall theme is focused on advancing AI technology and its applications.
Audio Transcript:
AI itself. Hey, you're in a racing game. Or hey, you're in an FPS, go control the mouse and do all these things. We're going to drop AI and humans into this world and they won't know what's going on until they start exploring. So even as I look at this screenshot, this is actually one of our first games. We call it locksmith. We give all of our games a cool little name like that. As I look at this, I don't know what's going on, right? But I start to explore
Video Content:
Understanding through exploration. Measuring the ability to adapt to tasks also cannot prepare for the future.
Audio Transcript:
and I start to understand, oh, there's certain things I need to pick up. There may be walls, there may be goals and objectives. I'm not sure what those goals and objectives are right when I first start, but that's the point. So not only are we're going to ask humans to explore and make up their own rules as to understand how to do the game, but we're going to require the same thing for AI as well. And that's something that we're not currently seeing
Video Content:
Understanding through exploration. Measuring the ability to extract data that you cannot prepare for.
Audio Transcript:
from the reasoning models that we have from there. Now the third key point is that we're only going to require core knowledge priors only. This is something that we carry from ARC AGI 1 and 2 as well. But what this means is, you'll notice the ARC tasks, there's no language, there's no text that's being involved here. There's no symbols, and we're not asking you any trivia. So I call these other benchmarks that rely on these, sometimes we try to make the hardest problems
Video Content:
Understanding through exploration. Measuring the ability to extract data that cannot prepare for AIE Core Knowledge Prerequisites Engineering Basic Math Basic Geometry Agent Metrics Objectives
Audio Transcript:
possible. We go hire the best people in the world, and I call them PhD plus plus problems, right? And that's great, but AI's already super human. It's way smarter than me in a lot of different domains. We take the alternative approach, which is let's look at more of the floor and look at the reliability side. Let's take anything outside of core knowledge and strip those away. So coronal's prior is the four of them that there are,
Video Content:
A person is standing at a podium, presenting in front of a large screen displaying a slide titled "Core Knowledge Prerequisites". The slide lists three categories: "Engineering", "Basic Math", and "Geospatial Knowledge". Under "Engineering", it mentions "Computer Science", "Engineering", and "Physics". Under "Basic Math", it lists "Basic Math", "Basic Geometry", "Agent Metrics", and "Object-Oriented". Under "Geospatial Knowledge", it includes "Geospatial Knowledge". The presenter appears to be discussing these prerequisites, likely in the context of an AI/ML course or workshop. The background also features logos for Microsoft, AWS, and SMOO.
Audio Transcript:
are basic math, and these are things that are humans that were either born with or hardwired to gather right immediately after birth. So basic math meaning counting up to 10, basic geometry, so understanding different shapes and topology. And then agentness, which is understanding theory of mind, that there's other types of agents out which is understanding theory of mind that there's other types of agents out there in the world that I know that they're interacting. And then the fourth one is objectness.
Video Content:
A person is standing at a podium, speaking into a microphone. The background features a large screen displaying a presentation slide titled "Core Knowledge Prerequisites". The slide lists three categories: "Engineering", "Programming", and "Geospatial Knowledge". Under each category, there are subcategories such as "Basic Math", "Basic Geometriy", "Agent Metrics", and "Object-Oriented". The slide also includes logos for Microsoft, AWS, and SMOO. The speaker appears to be explaining the prerequisites for a course or program.
Audio Transcript:
So as we create our benchmark, these are the four principles that we like to go after when we try to test the abstract and reasoning piece. Now, I was reading the recent D'Arkesh essay, and he actually put it really well in one of his paragraphs here. He was talking about one of the reasons why humans are great, and he says it's their ability to build up context, interrogate their own failures,
Video Content:
A presentation is being given on a screen displaying various slides related to AI and machine learning. The slides include topics such as Core Knowledge Prerequisites, AIE (Artificial Intelligence Engineering), Basic Math, Basic Geometry, Agent Metrics, Objectives, and Microsoft, AWS, and SMOO logos. The presenter is standing in front of a black background, holding a microphone and gesturing towards the screen. The slides also feature text discussing the importance of building context, integrating their own skills, and improving efficiency in practice.
Audio Transcript:
and pick up small improvements and efficiencies as they practice a task. We don't yet have this type of environment that can go and test this from a benchmark perspective for AI. And this is exactly what Arc AGI is going to go build. So before we wrap it up here, I want to talk about how we're going to evaluate AI because it's like, okay, cool, they go play the game. Well, what does it mean?
Video Content:
The video features a presentation by a speaker who is standing in front of a large screen displaying slides. The slides contain text discussing the capabilities of AI (Artificial Intelligence) in various contexts, such as building content, interacting with users, and improving efficiency in practice. The speaker appears to be explaining these points, likely providing examples or case studies to support their discussion. The background includes logos for Microsoft, AWS, and Smartsheet, indicating the involvement of these companies in the topic.
Audio Transcript:
How do you know if it's doing well or it's not? And we're going to bring it back to Francois' definition. So we're going to bring it back to skill acquisition efficiency. And we're going to use humans, which again is our only proof point of general intelligence, we're going to use humans as the baseline. So we're going to go and test hundreds of humans on these exact arc tasks, and we're going to measure how long does it take them,
Video Content:
The video features a presentation by a speaker at a podium, discussing the concept of Skill Acquisition Efficiency (SAE) as a proof of General Intelligence (G.I.) using the baseline of AIE. The speaker explains that SAE is not just about building content but also about interacting with others, collaborating, and improving efficiency through practice. The video includes a slide with the title "Skill Acquisition Efficiency" and logos for Microsoft, AWS, and SMOO. The speaker emphasizes the importance of adapting strategies based on feedback and adjusting them accordingly.
Audio Transcript:
how many actions does it take them to complete the game, and then we're going to get a human baseline, and we're going to be able to measure AI in the same exact way. So can the AI explore the environment, intuit about it, create its own goals and complete the objectives faster than humans? Well, if it cannot, I would go as far as assert that we do not yet have AGI. And as long as we can come up with problems
Video Content:
Skill Acquisition Efficiency Proof of General Intelligence (Plumari) as the baseline AIE Microsoft AWS SMO
Audio Transcript:
that humans can still do, but machines cannot, I would again assert that we do not have AGI with it. So we're going to be looking at skill acquisition efficiency as our main output metric here. Today, we're giving a sneak preak about what this looks like. This is World's Fair. Actually, next month in San Francisco, we're going to give a sandbox preview, so we're going to release five games. We know better than to try to wait till the end.
Video Content:
Skill Acquisition Efficiency Proof of General Intelligence (Plumari) was the baseline AIE Microsoft AWS SMOO
Audio Transcript:
We're going to make contact with reality. We're going to put out these fives. We're actually going to host the mini-agent competition too. So we want to see what is the best possible agent that people can do. We'll put up a little prize money. And then we're going to look forward to launching about 120 games. That's the goal by Q1 of 2026. Now, that sounds like it's not that many games, and you think it's not that many data points, but the richness of each one of these games goes really, really deep.
Video Content:
A speaker is standing at a podium, presenting information on a large screen behind them. The screen displays a timeline with three stages: 'Modelled Fair,' 'Previewed,' and 'Launch.' Each stage has a corresponding date range. The speaker appears to be discussing the progression of a project or initiative, likely related to artificial intelligence (AI) and machine learning (ML). The screen also shows logos for Microsoft, AWS, and SMOJO, indicating potential partners or sponsors for the event.
Audio Transcript:
There's multiple levels. It goes deep with each one of them. And it's quite the operational challenge to make all of these. And that's a whole other side of the benchmarking process, which I'm happy to talk about later. If this mission resonates with you, again, ArkPrize, we are a nonprofit. One of the best ways to get involved is through making a direct tax deductible donation from that. If anybody in the room knows any philanthropic donors, whether it be LPs or individuals,
Video Content:
A man is standing on a stage, speaking to an audience. He is wearing a black shirt and is gesturing with his hands as he talks. The background is a dark screen with text and logos on it. The text includes "AIE", "Microsoft", "AWS", and "SMOJO". There is also a QR code on the screen. The man appears to be giving a presentation or speech.
Audio Transcript:
I'd love to absolutely talk to them. But then also, we're looking for adversarial testers. We want to pressure test ARCAGI3 as best as we can. So if there's anybody who's interested in participating in the agent competition, whether it's offline or offline, online or offline. Let me know. Happy to chat. And then also, kind of cool story, we originally started with Unity to try to make these games,
Video Content:
Get Involved with ABC Pro-2022 Speaker: Wanda is a free subscription offering developers access to a wide range of tools and resources to help them create high-quality games. The subscription includes access to a variety of tools such as Unity, Unreal Engine, and other game development software. It also provides access to a community of developers who can provide support and advice on game development. Additionally, the subscription includes access to a range of resources such as tutorials, forums, and other materials to help developers improve their skills and knowledge. ABC Pro-2022 is available for purchase through the ABC Pro-2022 website. To learn more about the subscription and how to get started, visit the ABC Pro-2022 website.
Audio Transcript:
and we quickly found out that Unity was way overkill for what we needed to do if you're just doing 2x 64 by 64 games here. So we're actually making a very lightweight Python engine ourselves. So if there's any game developers out there, anybody wants to get involved with this and knows Python well we're looking for game developers and game designers as well. That is do we have
Video Content:
Get Involved with ABC Prize-2022 A worthy star for age 2 ABC PRIZE FOUNDATION ABC Prize is a global competition for game developers and gamers. It encourages the creation of innovative and engaging games that challenge players' thinking skills. ABC Prize is open to all ages and offers a prize of $10,000 for the winner. The competition is divided into three categories: Beginner, Intermediate, and Advanced. The winner of each category will receive a cash prize and a certificate of recognition. To enter the competition, participants must submit their game on the ABC Prize website. The submission process includes creating a profile, uploading a demo video, and providing a brief description of the game. The demo video should showcase the game's gameplay and highlight its unique features. The competition is judged by a panel of experts who evaluate the games based on criteria such as creativity, innovation, and technical quality. The winner of the competition will be announced at an awards ceremony in November 2022.
Audio Transcript:
Yeah, I think we have time in this case for a couple of questions. If anyone wants to come up, there's microphones, one, two, three of them, maybe a couple of questions. I'm going to kick that off. Sure, yeah, yeah. Yes, yes. Question for you. So, I don't know, I can repeat. Okay. All right. It's very hard to make estimates about timelines famously, but if you had to guess, how long do you think this new version of the benchmark you're making will take before it gets saturated?
Video Content:
A man is standing on a stage, speaking into a microphone. He is wearing a black jacket and white pants. The background is black with a grid pattern. There is text on the screen that reads "A WORTH STAR FOR AGE" and "ARC PRIZE FOUNDATION." There are logos for Microsoft, AWS, and SMOJO at the bottom of the screen.
Audio Transcript:
Well, the way I think about that is, well, I would say I'm counting in years. I'm not counting decades. We'll put it that way. Okay. Interesting. All right. Yeah, we'll take one at each mic looks like you've, it's well distributed, so starting over here sure hi you mentioned efficiency as part
Video Content:
A man is standing on a stage, wearing a black hoodie and white pants. He is speaking into a microphone. The background is black with a grid pattern. There are logos for Microsoft, AWS, and SMOJO at the bottom of the screen. The text "A WORTH STAR FOR AGE" and "ARC PRIZE FOUNDATION" is displayed prominently in the center.
Audio Transcript:
of the requirements and so I'm wondering for the the benchmarks if you're considering things like wattage or time or other ways of of using that as one of the criteria. Yeah, I love that question, and I would have put it in if I had more time, but I'm very opinionated about efficiency for measuring AI systems. If I could have two denominators for intelligence on the output, number one would be energy,
Video Content:
A man stands at a podium on a stage, speaking into a microphone. The background is dark with a grid pattern, and there is text on the screen above him. The text reads "A WORTHY STAR FOR AGE: ARC PRIZE FOUNDATION" along with logos for Microsoft, AWS, and SMOJO. The man appears to be giving a speech or presentation.
Audio Transcript:
because you know how much energy the human brain takes, and that's our proof point of general intelligence. So you can take how much calories the human brain takes. So I'd love to do energy. But the number two denominator is the amount of training data that you need for it. Neither of which are very accessible for closed models in the current day. So we use proxies. And the proxy is the cost. But then with interactive evils like this, you get another proxy, which is action count,
Video Content:
A man is standing on a stage, holding a laptop. He is wearing a black jacket and white pants. The background is black with a grid pattern. There is text on the screen that reads "A WORTH STAR FOR A GE" and "ARC PRIZE FOUNDATION." There are logos for Microsoft, AWS, and SMOJO at the bottom of the screen.
Audio Transcript:
and how long does it take you to actually do it? We're not going to have a wall clock within these games. It's going to be turn-based, so we won't have a clock to do it. Awesome. All right. Question two, and then we'll do three. Please keep them both very short. Yeah, very quick question. both very short. Yeah. Very quick question. Could you define more what do you mean by objectness? Yes. That one's actually quite simple. It's just understanding that when you look out into the world, that there's a mass of things that may act together.
Video Content:
A man stands at a podium, presenting in front of a large screen displaying the text "A WORTH STAR FOR AI: ARC PRIZE FOUNDATION". The screen also features logos for Microsoft, AWS, and SMOJO. The man gestures with his hands as he speaks, likely discussing topics related to artificial intelligence and the ARC Prize Foundation.
Audio Transcript:
So the crude way would be you have one pixel, but then it's surrounded by a whole bunch of other pixels and they all move together. You understand all those pixels as one, and it kind of acting as a one body rather than individual. And really, evolutionary wise, that's a same, that's a tree over there. All this is part of the same tree, that kind of thing. Final question, I'll keep this one super short.
Video Content:
A man stands at a podium, delivering a speech. The background is dark with a grid pattern, and there is text on the screen reading "A WORTH STAR FOR A GE" and "ARC PRIZE FOUNDATION." Logos for Microsoft, AWS, and SMOJO are displayed at the bottom of the screen.
Audio Transcript:
How do you distinguish between tasks that in the games that you guys are developing, how do you distinguish between tasks that humans cannot do and an AGI also cannot do? Like what is the North Star there? It's a good question. Tasks that humans cannot do are a bit out of scope for our thesis on how we want to drive towards AGI.
Video Content:
A man is standing on a stage, giving a presentation. He is wearing a black shirt and white pants. The background is black with a grid pattern. There is text on the screen that reads "A WORTH STAR FOR A GE" and "ARC PRIZE FOUNDATION." There are logos for Microsoft, AWS, and SMOJO at the bottom of the screen.
Audio Transcript:
So I would say that's not really the aim that we're looking for on that. That's a whole different conversation around super intelligence that's for another time. Thank you. All right. Thank you everyone and let's hear it for Greg. Awesome. Okay, so our next speaker, it is my pleasure to announce to bring to the stage, Akhansha Chaudhry. So while she gets up and is getting plugged in, I'll give a bit of an introduction.
Video Content:
A man stands at a podium, speaking into a microphone. The background is black with white text that reads "A WORTH STAR FOR AI26 AIE PRIZE FOUNDATION". There are logos for Microsoft, AWS, and SMOJO at the bottom of the screen.
Audio Transcript:
So she is working at Reflection AI. She's going to talk to us about using reinforcement, learning for autonomous coding. This is a huge area. This is what all the big labs are doing. I'm very excited to hear more about how that works, how she's getting it to work. This is the perfect person to talk to us about this. Akancho was the first author on the palm paper, which was an absolute breakthrough
Video Content:
Reasoning + RL OpenPlex AIE Training Agentic Reasoners Measuring AI/CL Interactive Reasoning Benchmarks Pre-Training Open Models for Autonomous Coding Unveiling Effective Reasoning Distillation at Scale
Audio Transcript:
LLM produced by Google a few years ago. She led up the, she was a lead researcher at Gemini early on, and now, of course, she's at Reflection AI where she spends all her time thinking about this. So welcome to the stage. If the presentation works, that's the AGI complete problem, right? you know Thank you. Okay. Just I'm trying to mirror it and that's the excitement here. Okay.
Video Content:
A video presentation titled "Reasoning & RL" is being delivered by Akankshita Chowdhury at OpenParX. The presentation covers various topics related to AI and machine learning, including training agentic rossioners, measuring AICs, interactive reasoning benchmarks, and unmasking effective reasoning distillation at scale. The speaker discusses the use of reinforcement learning (RL) for autonomous coding and provides insights into the challenges and opportunities in this field.
Audio Transcript:
You can see it, but I can't see it, so that's what I'm fixing right now. Let me solve it one more time. I'm just getting that ready. Good. Good.
Video Content:
A presentation slide titled "RL for Autonomous Coding" by Akankshsa Chowdhery is displayed on a screen. The slide includes logos for AIE, Microsoft, AWS, and SMOB. The speaker, wearing a black shirt, stands at a podium with a laptop in front of them. The background is dark, and the speaker appears to be giving a presentation.
Audio Transcript:
You're good to go? Yes. Awesome. Okay. Hi everyone. I'm aonsha. I was at Google for more than six years, and I led the research for Palm, and I was a lead researcher in Gemini. These days I'm working on pushing the frontier for autonomous coding with reinforcement learning.
Video Content:
A presentation slide titled "RL for Autonomous Coding" by Akankshita Chowdhury is displayed on a screen. The slide features logos of Microsoft, AWS, and SMOJ at the bottom. The background shows a serene forest scene with tall trees and a clear sky.
Audio Transcript:
So just to recap the arc of how we have progressed in large language models and why autonomous coding and why now. So I think everyone here or those of you who don't remember in 2020 there was this breakthrough paper that came out,
Video Content:
Recapping the era of LLMs... (Pre-Training) Scaling Laws for Large Language Models (LLMs) There is a power law relationship between LLMs' test loss and: • Amount of data used • Number of parameters used in the model Source: Luong et al. (2023)
Audio Transcript:
which talked about scaling laws for large language models. And if you were to take a 30 second recap, the main thing it said was that there's a power law relationship between the test loss of large language models. So if you use more compute, more data, and put more parameters in your machine learning model, which is a transformer model, you will get more
Video Content:
There is a power law relationship between LLMs' test loss and the amount of data used. The number of parameters used in the model also affects the test loss. The test loss decreases as the amount of data increases, but the rate of decrease slows down. The number of parameters used in the model also affects the test loss, but the effect is not as strong as the effect of the amount of data.
Audio Transcript:
performant model, and it will not be performant just in the domain in which you are training the model. It will actually be performant and it will generalize to many other domains. And the generalization was pretty much a feature in this particular case. So as the large language models got bigger, we saw continuous improvement across benchmarks
Video Content:
There is a power relationship between LLMs test loss and the amount of examples used. The number of parameters in the model also affects the test loss. This relationship is shown in the graph.
Audio Transcript:
to the point that they're starting to get saturated now. And the other interesting thing was that we saw emergent behavior, where capabilities were emerging in large language models that were not present in smaller models. And this is a classic slide that I show for the work that we did in Palm. So typically when you go about trying to solve math problems
Video Content:
Why did LLMs get bigger? Continuous improvement with scaling Emergent Behavior: Capabilities that only emerge in larger models AIE Microsoft AWS SMOO
Audio Transcript:
and you give the model some examples. On the left, you have a math problem around tennis balls, and then you give a second problem. The model output looks wrong, but what Palm and the subsequent set of papers showed was that if you ask the model to output its reasoning chains, which has become a very common concept now, but this is,
Video Content:
Emergent Behavior: Chains of Thought Standardized Prompting 1. A quick task is presented. How would you approach this? 2. Is there a predefined solution? No. 3. What is the goal of the task? 4. What are the steps involved in solving the problem? 5. How do you break down the problem into smaller parts? 6. What are the potential solutions? 7. How do you evaluate each solution? 8. What is the final solution? 9. How does this process help you solve problems more efficiently? Chains of Thought Prompting 1. A quick task is presented. How would you approach this? 2. Is there a predefined solution? No. 3. What is the goal of the task? 4. What are the steps involved in solving the problem? 5. How do you break down the problem into smaller parts? 6. What are the potential solutions? 7. How do you evaluate each solution? 8. What is the final solution? 9. How does this process help you solve problems more efficiently? Source: Facebook AI Research. Interviewed with Dr. Michael Wattenberg, 2020. Google's Training Rationale Researchers for Large Language Models (J2022) (https://arxiv.org/abs/2202.00001)
Audio Transcript:
remember, 2021, so four years ago, if you ask the model to show its reasoning change, then the answer actually is correct. So basically by getting the model to output as chain of thought or reasoning chains, the model performance improves. And this capability particularly emerged in large language models.
Audio Transcript:
These are all the models. So Lambda and Palm were These are all the models. So Lambda and Palm were these of the art models about three years ago and what I'm showing on X axis is the increasing number of parameters. Palm was scaled all the way up to 540 billion parameters. No one actually publishes the number of parameters these days, so you have to live with the graphs from three years ago or the open source stuff that's coming out with deep seats and quen models.
Video Content:
Chain of Thought emerges in Large Models Middle school math word problems LaDA GPT PaLM Standard prompting Chain of thought prompting Model scale (10^9 parameters in billions) Microsoft AWS Smoq Thought Prompting Reverses Meaning in large language models, 2022
Audio Transcript:
But what Y axis is showing is that the solve rate on middle school math word problems was increasing with the number of parameters in the models, and it was essentially increasing, mainly when you were prompting the models and asking them to show chain of thought, and this led to all kinds of show chain of thought, and this led to all kinds of prompting techniques where you ask the model to think step by step.
Video Content:
Chain of Thought emerges in Large Models Middle school math word problems LaDA GPT PaLM Standard prompting Chain of thought prompt Model scale (10^9 parameters in billions) What is superhuman? Thoughtful prompting is key to large language models. 2022
Audio Transcript:
You even go and bribe the model and such, and you ask the model nicely or not so this was all kinds of fun stuff and i think the thing that really stood out from this generation of models three years ago was that this capability was not just limited to math problems. It was basically generalizing across a whole bunch of domains anywhere from question answering in other languages
Video Content:
Chain of Thought emerges in Large Models Emergent Behavior: Ability to Solve Entirely New Tasks AIE LaDA GPT PaLM Standard prompting Chain of thought prompting Model scale (10^9 parameters in billions) Microsoft AWS Smoq Tugging Foundational Reasoning Requirements in large language models, 2022
Audio Transcript:
to puzzle problems to multitask natural language understanding problems. And what this led to next was that now that these models could reason, we could get them to follow instructions. So the first set of applications that became possible with these large language models were chatbot applications.
Video Content:
Emergent Behavior: Ability to Solve Entirely New Tasks AIE RL with Human Preferences data enables Chat Applications • We can approximate human preferences with Reinforcement Learning (RL) to feature RL agents • Humans are selected to trade the outputs of the model based on certain criteria. • Reinforcement learning using human preferences data enables chat applications. • RL algorithms are trained to maximize the reward function. • The reward function is defined as the sum of the rewards received by the agent over time. • The agent's behavior is optimized to maximize the reward function. • The agent's behavior is optimized to minimize the loss function. • The agent's behavior is optimized to maximize the accuracy of the model.
Audio Transcript:
So everyone remembers that chat GPT and now Gemini and various other chat bots have become extremely popular. All of us use them all the time, but what made them really possible was that when you give instructions to the model to go do something, it's actually able to do it. And the way it learns that is actually based on reinforcement learning. And the reinforcement learning data that we're giving to the model in this particular case is essentially data based on human feedback.
Video Content:
A speaker is presenting at a conference or seminar, standing behind a podium with a microphone. The background includes a large screen displaying slides related to AI and machine learning. The slides contain text about using human preferences data to enable chat applications, along with graphs and diagrams illustrating the process. The presenter gestures towards the screen as they speak, emphasizing key points. The setting appears to be a professional environment, likely a conference room or auditorium.
Audio Transcript:
So you're basically saying, okay, data based on human feedback. So you're basically saying, okay, here is a set of questions. And if I were to give it to a human, and there were two answers, which one would the human prefer? And if you have enough of this data and you train your model, you would actually end up with better performance because you taught the model which set of responses to prefer.
Video Content:
A speaker is presenting at a conference or seminar, standing behind a podium with a laptop in front of them. The screen behind the speaker displays a slide titled "RL with Human Preferences data enables Chat Applications." The slide includes bullet points about using RL algorithms with human preferences data to improve chat applications, such as reducing the number of interactions needed to achieve a goal. There is also a graph showing the relationship between the number of interactions and the time taken to achieve a goal. The slide also mentions Microsoft, AWS, and SMOOT as companies involved in this technology. The speaker appears to be explaining the benefits and potential applications of this technology.
Audio Transcript:
And this actually doesn't only work in chat but applications. It also works in code. So on the bottom right, I'm showing that even if you were to do this for applications in code, you start to see some performance improvements. Now, of course, the question is that last year there was a whole bunch of debate as to R.B. hitting the wall in terms of performance
Video Content:
A speaker is presenting at a podium in front of an audience. The presentation slide on the screen behind the speaker reads: "RL with Human Preferences data enables Chat Applications." It includes bullet points about approximating human preferences with reinforcement learning (RL) and how humans are selected to trade off the outputs of RL algorithms. There is also a graph showing the relationship between policy diversity and reward accuracy. The slide also mentions Microsoft, AWS, and SMOOT as sponsors or partners.
Audio Transcript:
of large language models, pre-training is not giving any gains, or all of these questions were on the horizon. So what is next? And one of the key questions to remember in all of this is that when you go and pre-trained the models, you end up spending a lot of money on training these models. It could be tens of millions of dollars.
Video Content:
What is AI? Training - Featuring - Inference AI Training and Inference: Both Complicated Training: - Every time we train, we're building a model differently. For image classification, training could entail tens of millions of iterations of data. Inference: - Even when the processes are on input (i.e., images), they can differ significantly. - A single difference can lead to an 100x difference in inference time (source: Google). - The complexity of the model affects all aspects of performance and costs.
Audio Transcript:
And when you do inference on the models, it's extremely cheap. These numbers are not endorsed by any of the companies I worked at, but these are public numbers from public sources. So going back to the main point that I want to make here is that training is extremely costly. So if you constantly try to scale up the model size, you end up in this regime of like,
Video Content:
AI Training and Inference: Are Both Costly Training: - AI models are trained on large datasets simultaneously. - For large LLMs, training costs could amount to tens of millions of dollars. Inference: - Every event in the process stream is input (e.g., image, text). - The inference process consumes resources (CPU, GPU). - It's expensive when the difference is small. - A single difference can cost up to $50,000 for a large LLM. - This makes execution expensive.
Audio Transcript:
if it's not giving performance gains, then can we get performance gains at inference time because inference calls are so cheap? And a key idea that was extremely useful here was that if you could get the models to generate multiple responses and then do majority voting.
Video Content:
AI Training and Inference: Are Both Costly Training: For large LLMs, training costs dominate compared to inference. Inference: Every time the process executes on input (e.g., query), it incurs a cost. A major difference is that LLMs execute once per request, whereas models like GPT-3 execute 10-50 times per request.
Audio Transcript:
So in the example above, I'm showing that we get, the prompt doesn't make sense, but you've given a mathematical problem to large language model, and you're asking you to generate three answers independently, and then you basically do some voting on top of those answers. And if two answers match, then that's a majority vote.
Video Content:
Scaling compute at inference time Can the model improve consistently as it queues more time thinking? Generate many samples - majority voting Gradually revise the previous response
Audio Transcript:
Or like if in this room I were to ask a question and all of you said yes then that is a majority vote. So similarly in large language models, if you can get the model to like generate many, many samples and then consistently get it to go, like get many of those answers to agree. This notion of majority voting or self-consistency had shown gains. So this kind of scaling computer inference time
Video Content:
Scaling compute at inference time Can the model improve consistently as it spends more time thinking? Generate many samples - majority voting Gradually revise the previous response
Audio Transcript:
was clearly one avenue to go push on. Another avenue that emerged and showed substantial value was that you could sequentially revise your previous response. So as humans, oftentimes we write the first answer and then we go evaluate our answer and we are like oh there's some mistake here it doesn't quite patch and then you go fix it.
Video Content:
Scaling compute at inference time Can the model improve consistently as it spends more time thinking? • Generate many samples - majority voting • Sequentially revise the previous response
Audio Transcript:
So basically, can we get LLLAMs to do the same kind of revision, looking at previous set of revisions? And this was the second, so basically having longer chains of thought and getting the model to improve consistently in inference time based on that. And these kind of techniques where you could verify your correct answer, so in math or in programming where you have unit tests, showed
Video Content:
Scaling compute at inference time Can the model improve consistently as it queues more time thinking? Generate many samples, majority voting Gradually revise the previous response
Audio Transcript:
very clear gains. So what I'm showing you here is an example from one of my colleagues work at Stanford, which is a publicly published paper and on the y-axis we have pass at k or coverage score and on the x-axis we have a number of samples. So as you basically are doing a lot of samples
Video Content:
Scaling Inference works for Coding Benchmarks Like the SWE-Bench? Verify?
Audio Transcript:
on the x-axis, your accuracy is improving with open source deep seek model and just taking more samples. So you're getting a very high score on sui bench verified compared to even state of the art back in end of 2024 of course now all of these scores have pushed up and we are roughly somewhere around 80% already. But what we want to take away here is the fact that these lines of work,
Video Content:
Scaling Inference works for Coding Benchmarks Like the SWE-Bench? Verify?
Audio Transcript:
they showed that inference time compute predictably gives us gains, especially in domains where we can verify. If we know how to verify the answers, then we actually know how to translate that into intelligence. And going back to my talk title, coding is one of those domains where we do have the capability to verify.
Video Content:
InfernoTime Company's productivity translates to Infogrespo. This results in a shift in compute allocations, especially for teams with automated workflows. The current paradigm involves pre-staging and filtering, which costs $1500 plus $1000 per day. The alternative paradigm involves inference, which costs $1000 per day. This could be efficient.
Audio Transcript:
And that gives us tremendous advantage in terms of building super intelligence on top of autonomous coding. Of course, now you can't. of autonomous coding. Of course, now you ask the question of what does automated verification mean here. So for inference time scaling to work, you need basically some way to say this output
Video Content:
Inferencing-Time Companies predictably translates to Inference-Time. This results in a shift in compute allocations, especially for teams with automated verifiers. Current Paradigm: - $100M+ + $100M + $200M + $100M = $500M Alternative Paradigm: - $100M + $100M = $200M - Could be efficient.
Audio Transcript:
is correct. Now in math, this is a very simple example if you were to give the input to solve this mathematical equation. And if you were to do the same calculation on a calculator, you can actually verify that that problem is correct, or that solution is correct, or that solution is correct. And similarly in math, you have formal proofs,
Video Content:
Inference Time Scaling Requires Robust Verification Input: Data 1 x 1 x 4 Verifier output: LLM output: 5 x 6 + 7 = 37 Correct
Audio Transcript:
so you can actually verify if things are correct. In coding, you have unit tests. In compilers, you can actually generate the code and then use PyTorch as a verifier, a pi torch the compiler as a verifier. And in fact, in domains where you don't have this kind of verification, then there is a large gap. If you were to generate a lot of solutions and then do majority voting you actually don't get as much gains
Video Content:
In domains with automated verification, repeated sampling directly leads to more regularity. Formal proofs provide for much unit tests for coding. For domains without verification, there is a large gap between formal methods such as model-based methods (e.g., model checker) and methods such as majority voting (e.g., threshold models).
Audio Transcript:
so what this roughly meant was that, okay, so inference time scaling would work in scenarios where I have automated verification, but that doesn't quite solve the problem for it to have real world impact and the reason for that is shown in this graph as to typically if you do majority voting and this is across multiple
Video Content:
A speaker is presenting at a podium during an AI event. The background includes slides with text discussing inference time requirements, robust verification, and the challenges of majority voting in verification across different domains. The slides show graphs comparing different methods and highlight the importance of formal proofs and unit tests for coding. The presentation also mentions the gap between formal verification and model-based methods like neural networks.
Audio Transcript:
different models on GSSM 8K, which is middle school math problems, and another math benchmark, if you were to sort them by correct fraction, you have to sample a lot. The correct generations could be very rare. So who has time to sample 10,000 times and then get a correct solution, you would be sitting there waiting, just finding the correct solution,
Video Content:
But the Correct Generations Could Be Very Rare So, it makes sense that majority voting doesn't work across the board? Columbia Business School and Harvard Business School 2019-2020
Audio Transcript:
unless you can actually figure out where the correct generation is. So basically scaling inference time compute with just majority voting or longer reasoning change is great in the sense that there is some correct solutions somewhere there, but it doesn't work well across the board. So what will get these models to learn to generate correctly during training.
Video Content:
But the Correct Generations Could Be Very Rare So, it makes sense that majority voting doesn't work across the board? Scaling inference-time compute with majority voting or longer reasoning chains doesn't work well across the board! Doesn't reinforcement learning teach the model to generate correctly with training?
Audio Transcript:
Well, in the chatbot application scenario, we saw that RL with human feedback did work. So can we apply the same principle here and get the model to generate correctly in where we can automatically verify the outputs. So our belief at reflection is that the next frontier for scaling is reinforcement learning,
Video Content:
But the Correct Generations Could Be Very Rare Scaling inference-time compute with massively varying or longer reasoning chains doesn't work well across the board! Doesn't reinforcement learning teach the model to generate correctly with training? A New Frontier for Scaling is RLLIF Pro (training) RLLIF ?? Microsoft AWS SMOO
Audio Transcript:
and we already have proof points from some of the frontier labs as well. And as David Silver and Sudden published recently, they agree with the, or rather they they are pioneers in reinforcement learning they say that we are basically entering the era of experience, like starting from AlphaGo and Alpha Zero,
Video Content:
What's Next? A New Frontier for Scaling is RLL Reinforcement Learning - the path to general superintelligence Welcome to the 5th AI Experience. Silver 6: Reinforcement Learning. 2027.
Audio Transcript:
where you had an era of simulation. And the next set of large language model era was where you scaled up with RL using human data. But the next era from this year is really the era of experience which was which will lead us to super intelligence So reinforcement learning will be a fundamental component in building super intelligent systems,
Video Content:
A speaker stands at a podium, presenting on a topic related to reinforcement learning and its path towards general superintelligence. The presentation includes a slide titled "Reinforcement Learning - the path to general superintelligence." The slide features a graph with two axes: "Start of Experiment" and "End of Experiment." The x-axis is labeled "Time" and ranges from 0 to 100, while the y-axis is labeled "Level of Intelligence." The graph shows two paths: one labeled "End-to-End" and another labeled "Deep Reinforcement Learning." The "End-to-End" path starts at the origin (0,0) and ends at (100,100), indicating a linear increase in intelligence over time. The "Deep Reinforcement Learning" path also starts at the origin but then deviates, showing a more complex trajectory with fluctuations before reaching a higher level of intelligence at the end. The presenter discusses the differences between these two approaches, highlighting the potential benefits and challenges of each method. The presentation is part of an event called "The Fire of Experience," which took place in June 2027, organized by AIE, Microsoft, AWS, and SMOFT. In the bottom left corner of the screen, there is a small image of a candle burning, adding a visual element to the presentation.
Audio Transcript:
especially in areas where we have automated verification. And some proof point for why this makes sense is that in math, over several papers, this is results from 01, but over several papers, we have already seen examples that if you give the model on the right side, test time compute on the y-axis,
Video Content:
A speaker is presenting at an event titled "AI Expo Europe 2021." The presentation focuses on Reinforcement Learning (RL) as a path towards general superintelligence. The slide shows a graph comparing the performance of RL agents in different domains over time, highlighting significant gains in performance as the number of training steps increases. The speaker discusses how RL has been applied to various domains, such as computer vision, natural language processing, and robotics, and how these applications have led to improvements in performance metrics like accuracy and efficiency.
Audio Transcript:
test-time compute is same as inference time scaling, and you measure accuracy on the x-axis, it should go up. But as you can repeat this process, and with the reinforcement learning, then the training time compute going up on X axis also improves the accuracy on Y axis for a challenging benchmark in math called Amy. Most of these benchmarks saturate within a year,
Video Content:
A person is standing at a podium, speaking into a microphone. The background features a large screen displaying two graphs and some text. The graphs show the performance of different algorithms on various datasets, with one graph showing a positive correlation between dataset size and algorithm performance, while the other shows a negative correlation. The text on the screen mentions significant gains achieved by scaling up RL (reinforcement learning) in MATH domains. The logos of Microsoft, AWS, and SMOOT are visible at the bottom of the screen.
Audio Transcript:
as you probably have learned by now. So this benchmark is already saturated. So now that I've hopefully convinced you that reinforcement learning and scaling reinforcement learning is the next frontier, you'd be like, okay, so why are, why is not everyone doing it? What's so challenging about it?
Video Content:
A presentation slide is displayed on a screen, featuring a graph titled "Scaling up RL in MATH domains shows significant gains...". The graph plots the number of episodes against the average reward per episode, showing a positive correlation between the two variables. The slide also includes logos for Microsoft, AWS, and SMOOT. The presenter stands at a podium, gesturing towards the screen as they discuss the topic.
Audio Transcript:
So as I have both large language models before, a big part of building these systems ends up being that the machine learning plus system stack for these systems themselves is very challenging. So here is an example of Y- up reinforcement learning is challenging. So if you are trying to do reinforcement learning with PPO, which is one of the algorithms used for RL with human feedback,
Video Content:
What makes calling RLJ challenging? ML system for RLLEFF with PPO is challenging to recover from failure. Original policy (blue) is not updated after failure. New policy (red) is generated by the agent. Agent fails to learn new policy.
Audio Transcript:
then it moved to DPO. You have to keep four copies of different models. So if you imagine a really large model, and then you have to keep four copies, then you have to arrange them somewhere on GPUs, in your large cluster, you can have some fun figuring out the exact layout. And it's a fun and interesting problem, but it's a hard problem in the sense that to make maximum utilization of these systems and arranging them in the right way,
Video Content:
What makes calling RLJ challenging? ML system for RLLEFF with PPO is challenging to recover from failure. Original policy (policy1) is not updated. New policy (policy2) is not updated. Failed reward signal.
Audio Transcript:
just building that system is extremely hard. And deep seek actually showed with deep seek math that GRPO gets rid of the value model and only has three copies of the model but that doesn't that's still a very challenging problem. So scaling up RL is even more challenging than scaling up LLMs because you have multiple copies of the model
Video Content:
What makes calling RLJ challenging? ML system for RLLEFF with GRTV is still challenging in memory footprint. Source: https://arxiv.org/abs/2103.06589 Microsoft AWS SMOKE
Audio Transcript:
and you have a training loop and an inference loop. And then on the machine learning side, on the, on the reinforcement learning side, you also suffer a lot from reward hacking if you're the model that is deciding that this is the correct answer as a neural reward model. So you, as we discussed before, in autonomous coding applications, you do have the ability to verify your output,
Video Content:
What makes scaling RL challenging? Neural reward models may suffer from reward hacking in the large-scale RL. However, autonomous coding applications have access to verifiable rewards Correct final answer for given inputs Execution Feedback Call To Action
Audio Transcript:
which roughly means that you can decide this is the correct answer or not. That's how Sweet Bench verified scores of work today. You have execution feedback, you have unit tests. So all of these possibilities, of course, this is an ongoing list, all of these possibilities mean that you can design better reward functions.
Video Content:
What makes scaling RL challenging? Neural reward models may suffer from reward hacking in large-scale RL. However, autonomous scaling applications have access to verifiable rewards Correct final answer for given inputs Execution Feedback Call To Action
Audio Transcript:
Okay, so this means that autonomous coding is a great domain for scaling up RL. Then the question becomes, how does this have real world impact? So in software engineering applications, generation of code is only one part of the system. If you look at end-to-end workflows for software engineering, there is many more parts to that system. How do you scale up your many more parts to that system. How do you scale up your system to generalize across all of those domains?
Video Content:
What makes scaling RL challenging? Neural reward models may suffer from reward hacking in large-scale RL. However, autonomous coding applications have access to verifiable rewards Correct final answer for given inputs Execution Feedback Call To Action
Audio Transcript:
So that's the problem we are trying to solve at reflection. Our mission is that we would like to build super intelligence and we are starting with autonomous coding as the root node problem. and we are starting with autonomous coding as the root node problem for this omission. And we have a team of about 35 pioneers who have pioneered various legendary works in LMs and reinforcement learning.
Video Content:
What makes scaling RL challenging? Neural reward models may suffer from reward hacking in large-scale RL. However, autonomous coding applications have access to verifiable rewards Correct final answer for given inputs Execution Feedback Call To Action Reflection team Team of 57 pointers in LLMs and RL with 20k+ v-tokens Alpaca/Alpaca++/AlpacaGen/ChatGPT/ChatGPT-3/Uninteract Microsoft AWS Smoq
Audio Transcript:
So if you're excited about this mission, you can reach out to one of us, my email is my last name at reflection. .aI, and we would love to work with you. And with that, I can take questions. All right, same protocol as last time.
Video Content:
Reflection team Team of 57 pointers in LLMs and RL, with 2084+ citations. Acknowledgements: contributed by the session Presented by: Dr. Andrew Brown Sponsors: Microsoft AWS Smarter
Audio Transcript:
If you have a question, please come up to one of these three microphones we have distributed throughout. We can probably take one or two questions. So if you want to ask something, feel free. I'll do the first one while people are coming up. So I'm curious, it seems like the foundation models are tried to build one model and to play it across everything.
Video Content:
Reflection team Team of 55 pointers in LLMs and RL, with 2084+ veterans. A reflection-based approach to the system. Developed by: Jiajiao Supported by: Microsoft, AWS, and Snoop Subscribe: YouTube
Audio Transcript:
Do you have an opinion with the work you're doing right now if you think that's the right approach or if you think there'll be more specialization on different languages or even like individual code bases, or do you feel like the best approach is just to have like one model that's trained across the greatest diversity of tasks possible? I think I will answer your question in terms of building coding agents does require
Video Content:
Reflection team Team of 55 partners in LLMs and RL, with 2084+ veterans. Aim: To develop AI models that are trustworthy, accountable, and transparent.
Audio Transcript:
multiple capabilities and how you get there, you will definitely need multiple LLM calls. And then whether that's one model or multiple models. I think that's the secret sauce right now for most people. Fair enough. All right, please. Hi. I'm wondering in the slide with the chart of error of simulation, error of something and error something in her of experience.
Video Content:
Reflection team Team of 57 partners in LLMs and RL, with 2084+ veterans. A neuroethics-consulted by the team. Presented by: Dr. Andrew Dyer Subscribe: YouTube Follow: Twitter LinkedIn
Audio Transcript:
They had put in Alpahou and the previous one where also they played StarCroft or something. They all used MCDS, which, I mean, maybe it's my unfamiliarity with them, but it's also data simulation. So we're using synthetic data for error of experience as well.
Video Content:
Reflection team Team of 57 partners in LLMs and RL, with 2084+ veterans. Aim: To develop a framework for evaluating AI models.
Audio Transcript:
So how does, why is that call simulation and why is what we're doing right now not call simulation? What's the sort of overlap between simulation experience? How does that, how do you think about that? I can ask Dave that question, you know, but going back to the point, I think the better way to answer that question is roughly what Greg covered in the last talk where his comment was that, so you can envision what scenarios might happen next and you're
Video Content:
Reflection team Team of 55 partners in LLMs and RL, with 2084+ veterans. Able to communicate with the system Microsoft, AWS, and SMOOT
Audio Transcript:
basically using that to build your reinforcement learning. So you're doing rollouts and you're basically building based on that. In real world, in most scenarios, you have an imperfect rollout. So you don't have full knowledge of how the system might work. Simulation is possible in certain domains
Video Content:
Reflection team Team of 57 partners in LLMs and RL, with 2084+ veterans. Aim: To develop AI models that are more human-like and ethical.
Audio Transcript:
where you do build a world model, which is closer to robotics and all the work that's happening in the physical AI space, right? But in the real world applications, which is what we're targeting, you will have imperfect things. So you have to actually experience the real world and you have to collect some data and that data is not gonna be in any way complete,
Video Content:
Reflection team Team of 55 pointers in LLMs and RL, with 2084+ citations. Acknowledgements: contributed by the speaker Presented by: Jiajiao Supported by: Microsoft, AWS, and Snoop References: PPT
Audio Transcript:
nor will it complete early search, the exponential search space that could exist. Awesome. awesome thank you so much Akhancha we'll have to stop it there but I assume will you be around afterwards to answer questions? Awesome. If you have more questions, please do. Next, we're going to welcome to the stage, Ryan Martin. So Ryan is one of the founding engineers at bespoke labs.
Video Content:
Reflection team Team of 55 pointers in LLMs and RL, with 2084+ veterans. AIAE Microsoft AWS Smoq
Audio Transcript:
Sounds like he's got a friend. Anyway, we are very excited. So Ryan's going to be talking about a project that he worked on, that they worked on, building a reasoning model, it's called OpenThinker, that actually is able to outperform the distilled versions of R1. So he's going to talk through what they built there and yeah, like what we can learn from it.
Video Content:
A speaker is presenting at an event titled "Reasoning + RL" by OpenPlex. The event takes place on June 4-5, 2023, and features sessions on training agentic rossioners, measuring ACFs, interacting with benchmarks, and unmasking effective reasoning distillation at Scale. The speaker is discussing data recipes for reasoning models, specifically focusing on Ryan Marten's bespoke lab.
Audio Transcript:
So welcome. Thank you. Thank you. Thank you. So yeah, I'm Ryan. I'm a founding engineer at Bespoke Labs, and today I'm going to talk to you about open thoughts, which is our project to create the best open source reasoning data sets. And I'll be switching tack a little bit from our earlier discussions on reasoning and RL and focus on the reasoning part,
Video Content:
OpenThoughts Data Recipes for Reasoning Models Ryan Marten Bespoke Labs Microsoft AWS SMOO
Audio Transcript:
and you'll see why. So just so we're on the same page, we've talked a lot about reasoning, but what's actually going on here. So I like this graph from Jason, which shows this incredible performance that's happened in the last several months, where models are getting much much much better on certain benchmarks and if you look at that this is reasoning this is test time scaling.
Video Content:
A speaker is presenting at an event called OpenThoughts, discussing progress on AI benchmarks over the past five years. The presentation includes slides with graphs showing the performance of three different models: Three questions (ThreeQ), Contextualized (Contextual), and Predominant (Predom). The graphs illustrate the accuracy of these models over time, with the Contextual model showing significant improvement compared to the other two. The speaker highlights the importance of these advancements in AI research and development.
Audio Transcript:
I think everyone here is quite familiar with this, and it seems that certain tasks like Amy, which are competitive math problems, really respond to models when they're able to think step-by-step and do these long chain of thoughts. So let's go back to DeepSeek R1. Now, DeepSeek R1 was really impressive for a lot of people for a lot of reasons, and RL was a big part of that.
Video Content:
A speaker is presenting on a stage, discussing advancements in AI reasoning over the past five years. The presentation includes slides with graphs showing the progress of AI benchmarks, highlighting three quadrants: Theory, Practice, and Prediction. The speaker explains how these advancements have led to significant improvements in AI capabilities, particularly in areas like understanding context and making predictions based on data. The presentation also mentions Microsoft, AWS, and SMOQ as key players in this field.
Audio Transcript:
But I was also particularly interested because DeepSeek R1 at the end of the day is an SFT model. So the final weights that they've released are actually from DeepSeek V3 base, which is fine-tune on 800K S of T examples, 600K of which are reasoning. Of course you can see here that RL was a big part of it and RL was used heavily to create
Video Content:
Strong Reasoning Through SFT
Audio Transcript:
that model which generated this data. But at the end, it was SFT and a little bit of RL for alignment. So this is really interesting and surprising. And the other thing that was really interesting and surprising to us was these small reasoning models that DeepSeek released, which were incredibly strong. And this for us was
Audio Transcript:
a huge motivation, a huge motivation, to try to do this ourselves. And why is that interesting? Because if we go back to here, no additional detail was really given on these data sets here. So if you want to create strong reasoning models, we now sort of have a training recipe, but we don't have the data recipe. That's the missing link. Okay, I want to also include a slide here on why is it interesting to train your own reasoning
Audio Transcript:
models? So I'm partially taking this from Amir's talk yesterday on open source and enterprise, which I really liked. But there's these main points, performance, privacy, speeding cost, and then ownership and destiny. I think using reasoning is a great tool to solve a problem and you shouldn't limit yourself in your toolbox if you're trying to solve a specific domain task.
Video Content:
Training your own Reasoning Models AIE Microsoft AWS SMOB Performance Privacy Latency / Cost Ownership / Custody
Audio Transcript:
So as we talked about before, RL is a great tool in this toolbox to tackle reasoning tasks, but we're going to see here that SFT is, as Nathan put this morning, extremely easy and extremely effective. Okay, great. Now, the missing link. How do we actually solve for this reasoning data recipe?
Video Content:
Training your own Reasoning Models Performance Privacy Literacy / Cost Ownership / Custody
Audio Transcript:
There's all these questions that we had when we started. How much data do you really need? What data creation steps are necessary? What are the optimal choices for each step in that data creation pipeline? And then how do you even go about figuring all this out? And this is the meat of the Open Thoughts project. So today we're excited to announce
Video Content:
Solving for the Reasoning Data Recipe • How much data do you need? • Which data curation steps are necessary? • What are the best choices for each curation step? • How do you figure this out?
Audio Transcript:
Open Thoughts 3, which is hot off the presses, just came out two hours ago, which is our latest and greatest version of our reasoning datasets. And... our reasoning data sets. And, Yeah. Thank you. And now this is the state-of-the-art reasoning dataset recipe. So you can see here, these graphs are showing accuracy on three of these reasoning benchmarks,
Audio Transcript:
Amy, which is competitive math, LiveCodeBent, is competitive code and GPQA Diamond, which is our science questions. On the y-axis, you see accuracy is going up. On the x-axis you see accuracy is going up on the x-axis you see the data scale is going up so we heard before that scaling is difficult, particularly difficult with RL. The good news is for SFT, scaling is quite easier. You can see here we compare to other open reasoning data sets.
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing OpenThoughts3.4, which is described as the SOTA reasoning dataset recipe. The presentation includes slides with data points comparing different datasets such as AIRE 2025, LiveCodeBench, and GPUQA Diamond. The speaker mentions the dataset sizes ranging from 1K to 10K and 100K, with a focus on the OpenThoughts dataset, which has 61.5 million tokens and 7.78 million instructions. The background features logos of Microsoft, AWS, and Amazon, indicating the involvement of these companies in the research or development.
Audio Transcript:
So Neumatron Nano, Nvidia released this great model, Neumatron Nano, it's an AP model, and they also released the data set to train on it. So we compare directly by training on the same base model between our dataset, which is our dataset recipe, and the nematron nano data, which is the invidia recipe. And you can see here there's a significant gap so we've shifted this scaling curve upwards
Video Content:
OpenThoughts3 is the SOTA reasoning dataset recipe. AIE LiveCodeBench GPUQA Diamond 10K 100K 1000K Dataset Size 0 50 100 150 200 250 300 350 400 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 OpenThoughts Notes: 61.5% AIE: 75.78% Incorrect
Audio Transcript:
great so that yeah this is the state of the art 7B open data reasoning model. You can see we have measured across the domains of interest of science, code, and math, and then a couple held up benchmarks. So our original goal was to reproduce, to find the missing link for the deep seek distill
Video Content:
OpenThoughts3-7B is the SOTA 7B open-data reasoning model. AIE Microsoft AWS SMOKE
Audio Transcript:
models. You can see here we've crushed that goal. So we're significantly outperforming the DeepSeek R1-Q. 207b model, which we started off trying to reproduce. And then compared to the nematron nano model, which is trained on a different base model, we are also outperforming on some benchmarks and similarly competitive on some others.
Audio Transcript:
So, okay, let's actually talk about how we achieve this. This is the interesting part for you. So if we go back to the scaling graph. You can see, once again, on the x-axis, we're scaling data set size. So this is a huge method to increase accuracy, and the thing here is it gets more and more expensive exponentially
Video Content:
How we achieved this AIME 2025 Accuracy (%) 1M 3M 10M 100M Dataset Size Microsoft AWS SMOI
Audio Transcript:
it's more expensive as you keep going and then going. And then on vertically you can see that we've shifted this the scaling curve up. So this is what I was talking about before. This is the improving the dataset recipe. So given a fixed data set recipe, you can always scale it larger and you can always have higher performance. But if you want to push your performance to absolute maximum, the real question is,
Video Content:
How we achieved this AIME 2025 Improving CloudOps Big Data Size
Audio Transcript:
how do I create the best data set? And therefore, what is the best recipe for the data set? OK, so enough teasing here. Let's go into the meat of it. So this is how we approach this problem. We broke down the dataset pipeline this problem. We broke down the dataset pipeline into sourcing questions, mixing different sources of questions, filtering those questions, filtering those questions,
Video Content:
How we achieved this AIME 2025 Improving the Dataset Recipe Dataset Size 1M 3M 10M 100M 1B Accuracy (%) Improving Dataset Recipe Source: Amazon Wiki Other Datasets Converted to Model Dataset Final Dataset Microsoft AWS SMOO
Audio Transcript:
filtering out the highest quality questions, generating answers with a teacher model, so that's distillation, and then filtering out bad answers. And lastly, at the end of this entire experimentation, we looked at what are the best teacher models? Which teacher models should we select? So through this entire pipeline, we've come down to this final dataset recipe. Now this was a ton of work. This is a
Audio Transcript:
screenshot of our hugging face page. So you can see created over 5,000 datasets and almost 3,000 models. For this project, it was only around 1,000 experiments, but just to give you an idea of how rigorously we looked at the different decisions in each of these steps of the pipeline. And also I think this is interesting because it peels back the curtain a little bit on
Audio Transcript:
maybe what the frontier labs are doing, finding signal at the smallest scale possible, and trying out as many things as possible and empirically choosing the best and then scaling. And often sometimes when you scale, you see, okay, what was the best of the small scale? It doesn't actually work. But if you're lucky and you've done good science, then your yolo run will be the best possible, right?
Audio Transcript:
Okay, so these are the key learnings that we had from our dataset recipe, and this is what you can take away. So the first thing is that pretty surprising, sampling multiple answers, so multiple reasoning traces per question in your data set works really, really well.
Video Content:
Improving the Dataset Recipe - Sampling multiple answers per question from a teacher model to images. - Increases data accuracy by up to 10%.
Audio Transcript:
The performance does not go down at a fixed scale. If you take a fixed scale of questions, say 30K questions, or 30K examples and of those you if you take just 30k questions and you only sample once per question that performs pretty similarly to if you took one 16th so 30k over 16, and then for each you sample 16 times, which is quite cool.
Video Content:
Improving the Dataset Recipe - Sampling multiple answers per question from a teacher model to images. - Increases data accuracy by up to 10%.
Audio Transcript:
So this allows you, this is really cool because this allows you to scale by 16x, which is more than an order of magnitude. And if you remember the graph from before, that corresponds to a pretty large increase in accuracy. The other surprising thing that we found was that a better model in terms of its own performance on evaluation benchmarks does not necessarily mean it's a better teacher model.
Video Content:
Improving the Dataset Recipe - Sampling multiple answers per question from a teacher model to images. - Increases data accuracy by up to 10%. - Improves the quality of the dataset. - Reduces the need for human annotation.
Audio Transcript:
I think a good way to think about this is a brilliant researcher who's maybe a terrible lecturer, right? We found specifically Quinn 32B was a stronger teacher model than deep seek R1. So we switch to that in our recipe, even though previously everyone has been using R1. We also found that the sources of data that had synthetic questions were actually quite good. Some of the top sources that we selected
Video Content:
Improving the Dataset Recipe • Expanding multiple answers per question field to a teacher model in images. • Increasing data accuracy by up to 10%. • Improving the quality of the dataset through better feedback. • Q42-230 is a stronger teacher than DeepSeek 691, although it scores lower on measuring benchmarks. • Synthetic question generation: This is a way to generate questions without writing them out, and it tends to produce more naturally crafted questions.
Audio Transcript:
were entirely synthetic. And better than sources say that scraped from forums or had humans manually write things. And this is also really good news because synthetic question generation is scalable. So once again, we go back to the x-axis and we can push even further, which is accuracy boost. So question filtering also works well.
Video Content:
Improving the Dataset Recipe • Expanding multiple answers per question field to a teacher model's magic. • Increasing data size by up to 100 times. • Improving accuracy by up to 20%. • Using a new dataset recipe called Synthetics. • Synthetics: question generation • Synthetics creates big questions (more words, more writing), and better than manually created questions.
Audio Transcript:
Here we filtered questions by asking a language model, how difficult is this question, and then taking only the hardest questions. We also had a language model try to answer that question and looked at the length of that answer. So these are sort of proxies for the same thing. You can imagine that if a problem is a lot harder, then a language model will think more and it will produce more text,
Video Content:
Improving the Dataset Recipe • Streaming multiple queries per second from a batched query model to AIE • Increased data usage by over 100% • Improved performance by 2x-3x • Synthesized question generation • Questioning filtering • Thinking with LLMs is difficult if LLM requires frequent restarts without using embeddings (e.g., DPT5) AIE Microsoft AWS SMOKE
Audio Transcript:
so its answer will be longer. And these things worked better than embeddings-based approaches or fast text classifiers, which is interesting as so much that those approaches were typical for pre-training. So it seems that the filtering for data and post-training is quite different than pre-training.
Video Content:
Improving the Dataset Recipe - Streaming multiple queries per second from a batched query model to AIE - Increasing data usage by up to 10x - Using the new Querying Recipe feature - Querying Recipe is a stronger challenger than Deeplake, although it scores lower on measuring benchmarks - Synthetics: Query generation is more accurate, more synthetic, and better than manually created questions - Questioning Writing: More accurate, more synthetic, and less human-like than manually created questions - Finding bugs: Less likely to find bugs if the required response is worked out under understanding of the data
Audio Transcript:
Some things that didn't work that were also quite interesting. Through our experiments, you saw that choosing a smaller number of high quality sources was much better than trying to optimize for diversity by going for a larger number of sources. It's very counterintuitive, right? You'd think, okay, I'm always going to go for higher diversity, but this is actually not what we saw. The last thing was interesting is that people talk a lot about verification, which is obviously
Audio Transcript:
very important for RL. And we actually see for SFT and distillation, it didn't seem that filtering based off of the answer or verifying the answer really helped it all. This is quite surprising. And I think there's some good research in the literature about maybe why this is, because if you have the hardest problem, it might be still helpful,
Video Content:
Improving the Dataset Recipe - Expanding multiple answers per question by training the model on images. - Increasing data diversity by using more diverse datasets. - Using a larger dataset for training. - Utilizing a larger batch size for training. - Using a larger learning rate for training. - Using a larger number of epochs for training. - Adding diversity to questions by not asking the same question twice. - Adding a small number of high-quality business cases better than duplicating the same type (e.g., 10 vs. 1). - Adding a small number of low-quality business cases better than duplicating the same type (e.g., 10 vs. 1). - Adding a small number of duplicate questions to help train the model.
Audio Transcript:
even if you have an incorrect answer to that hardest problem, keeping it in and seeing how the teacher model attempts. It's not just the final output that matters. Okay, great. Okay, so those are all the amazing learnings that we had for Open Thoughts 3, which super excited to share. But now you're probably thinking, okay, they've done a thousand experiments. I don't want to do a
Video Content:
Improving the Dataset Recipe - Expanding multiple answers per question: A teacher's trick is to imagine. - Increasing data accuracy by 10x at level 2. - Using a dataset recipe to train a model. - Q&A-2Q3R is a better technique than Q&A-4Q4R, although it scores lower on reasoning benchmarks. - Symbolic question generation: A more natural way to ask questions, rather than manually creating questions. - Question filtering: Adding diversity to questions doesn't hurt. - Adding identity to questions doesn't hurt. - A small number of top 10 high-quality answers works better than diversifying the top 100 or 1k. - A small number of top 10 high-quality answers works better than diversifying the top 100 or 1k.
Audio Transcript:
thousand experiments. I still want to create reasoning models. How do I adapt this if I want to create specialized reasoning models? So I guess the first thing I would say is be aware that based off of your domain, these exact choices might be a little bit different. I would suggest, okay, start with our recipe and then iterate on it. If you have capacity and compute, try a couple different choices for each step in the pipeline.
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains Question Meeting - LLM efficacy labels for code, LLM response length for math and science
Audio Transcript:
And I think a good example of this is we studied each step in the pipeline differently by domain. So we studied it distinctly for code, science, and math. And we saw, for example, in the question filtering, which I talked about before, using difficulty labels worked well for code questions, but for math and science, it was a response length.
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains Question Meeting - LLM difficulty labels for code, LLM response length for math and science
Audio Transcript:
And if you think about that for a second, it makes a little, it makes sense because the response length for coding questions are very different, right? For Amy Math, it's literally just a number between zero and a thousand. So the answer is not, it's not considering a large portion of the length. But you can imagine there's very simple coding questions in which the answer is still a lot of lines of code.
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains Question Meeting - LLM difficulty labels for code, LLM response length for math and science
Audio Transcript:
So yeah, this is one thing to be aware of. The other thing which I talked about previously is synthetic question generation. Because it works so well, and if your specialized domain, if you don't have a lot of data for your particular problem, then go ahead, transform that existing data into questions, expand it, throw those as in context examples,
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains Question Meeting: LLM difficulty labels for code, LLM response length for math and science Synthetic question generation for your friend Use the SQL-like syntax. Express it, transform or transfer your existing data. AIE Microsoft AWS SMO
Audio Transcript:
and generate more data. So yeah, we build an open source library for this as called Curator, and you can try that out. And then lastly, I feel like everyone says this, but it can't be said enough. The evaluation is paramount. If you don't know how well your models are doing or improving, then you cannot make good principled decisions about your dataset recipe. We spent a lot of time on this. We also have this open source library on GitHub called Evalchemy,
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains: Question Mentoring: LLMs difficulty is based on code, LLMs response length for math and science. Synthetic question generation: Your friend. We'll build Evaluation for this. First evaluate small sets, run many times and average. Rubik's Cube evaluation is separate. We'll build Evaluation for this. First evaluate small sets, run many times and average.
Audio Transcript:
which takes care of this and also takes care of the shardine and paralyism. And the key thing here is for very small evaluation sets. If you only have a handful of questions, you should run your model on those evaluation sets many times in average. So going back again to AME competitive math questions, there's only 30 per year.
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains: Question Mentoring - LLM difficulty is guided by the code, LLM response quality for math and science. Synthetic question generation: Your friend's questions, answers, or translations. Rubricated evaluation is paramount. We'll build Evaluation for this. First evaluate small sets, run many times and average.
Audio Transcript:
So for our evaluations, we gave the model those 30 questions 10 times, and then we averaged to get the final signal to determine which data strategies were working better than others. Because otherwise there's too much noise. Okay, this is also very, very, otherwise there's too much noise. Okay, this is also very, very interesting and surprising, and promising for you if you're specializing.
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains. Question Mentoring: LLM difficulty is guided by code, LLM response length for math and science. Synthetic question generation: Your friend's questions. Rubricated evaluation is paramount. We build Evaluation for this. First evaluate small sets, run many times and average. You use the teacher for some domains. Ques2Ans to detect the correct answer. LLM verified high scoring data points. Q&A with questions to increase SLOs (Scored Learning Objectives) and standardizing response quality decisions.
Audio Transcript:
It seems that you can actually surpass the teacher in some demands with distillation. This is super cool. Usually you think about only R.L. can push the frontier. Distillation is just about catching up to the teacher. But no, that's not the case. So, we have an example. It's in our paper, where we looked at the legal reasoning domain. So the problem of classifying Supreme Court decisions.
Video Content:
Adapting the Dataset Recipe to Your Domain Different choices are better for different domains. Question Mentoring: LLMs (large language models) are better at code, LLMs' response length for math and science. Synthetic question generation: Your friend's questions are more likely to be relevant or interesting than those generated by a model. Rubric-based evaluation is preferred. We use Web Evaluation to find the best evaluation setup, run it many times, and average the results. Once you surpass the teacher in some domains, check if LLMs can detect the source of the answer. Verify high-researching data points. LLMs can't answer questions because they lack the ability to access external knowledge bases or explain complex concepts.
Audio Transcript:
And what we did is we took 2K unique questions, we sampled five answers per question and then we did do verification here which which did matter. So we threw away any questions, any answers that were incorrect. And when you fine-tune the 7b model, it surpasses R1,
Video Content:
Adapting the Dataset Recipe to Your Domain Different choices are better for different domains. Question Mentoring: LLMs (large language models) are better at code, LLMs' response length for math and science. Synthetic data generation: A good way to generate synthetic data is to use a dataset creator or transfer training. Rubric-based evaluation is common. We evaluate ourselves by this: we use small datasets, evaluate many times, and average. Quesum uses a 25-point scale to detect the severity of an NLP task, verifying each question using data points. Quesum also questions to increase SLO (Service Level Objective) awareness and standardizing score decisions.
Audio Transcript:
which is a very strong reasoning model and also a very huge reasoning model. So this is very exciting. and there's a lot more research and also application to be done here. Okay. Cool. Okay, cool. So everything's open. It's open thoughts and open thoughts means open. Go out and build. We have all of our, we've got our detailed paper.
Video Content:
Adapting the Dataset Recipe to your Domain Different choices are better for different domains. Question Mentoring: LLMs difficult to code, LLM response strength for math and science. Synthetic Data Generation: Your friend. Evaluate your own data or use pre-existing datasets. Evaluate your own data or use pre-existing datasets. Go forth and build! Paper: https://arxiv.org/abs/2106.04113 All the details. Model Weights: https://github.com/superglue-llm/Open-Trivia-Model (GitHub) OpenTrivia Repo: https://github.com/superglue-llm/OpenTrivia (GitHub) Evaluation Library: https://github.com/superglue-llm/Evaluation-Library Dataset Repo: https://github.com/superglue-llm/Synthetic-Data-Library
Audio Transcript:
It's just out this morning. We've got the weights data set. We have a ton of repos for code for data generation, for evaluation and synthetic data. So check those out. This is the team. It was a huge group of people. A lot of work over many months. I think we're all very proud of what we did, but there's lots of people to recognize here.
Video Content:
Go forth and build! • Paper: https://arxiv.org/abs/2106.04113 (All details) • Model Weights: https://github.com/microsoft/AI-Explorer/tree/main/Models • OpenThoughts Repo: https://github.com/microsoft/AI-Explorer/tree/main/OpenThoughts • EvolvedQuery Repo: https://github.com/microsoft/AI-Explorer/tree/main/EvolvedQuery • Curator Repo: https://github.com/microsoft/AI-Explorer/tree/main/Curator
Audio Transcript:
If you scan that QR code, it goes to the tweet, and everything about the Open Thoughts Project is linked in from there. Thank you so much, Ryan. That was fascinating. Looks like we're already getting, we have at least one question lined up.
Video Content:
The OpenThoughts Team Sarae Gudin, Senior Member, AI Engineer, Microsoft Niklas Rönnberg, Senior Member, AI Engineer, Amazon Web Services Johannes Kühn, Senior Member, AI Engineer, AWS David Hultén, Senior Member, AI Engineer, AWS Tobias Schäfer, Senior Member, AI Engineer, AWS Thomas Hultén, Senior Member, AI Engineer, AWS OpenThoughts.ai Microsoft AWS SMOTE
Audio Transcript:
Again, we have time for maybe a couple of questions. So if you have questions, please line up and we'll do it. Actually, before we get to those questions, I will say as people are leaving, we are gonna be back here at 2 o'clock. We've got an excellent afternoon planned on this track. We've got Nathan Lambert. We've got Christian Segety,
Video Content:
The video features a presentation by The OpenThoughts Team, which includes members such as Daniel Glaeser, Aaron Haines, Michael H. Haag, Miguel Rangel, Samantha Sperling, and others. The team is associated with Microsoft, AWS, and SMOQ. The presentation appears to be part of an event or conference, as indicated by the presence of a QR code and logos at the bottom of the screen.
Audio Transcript:
who's the co-founder of X, and it's going to be a really great track at 2 o'clock back in this room. Also one more thing, if you do have questions for any of the speakers from this morning, hopefully they're going to be able to stick around. Don't let them go to lunch. They're going to be, they're sitting up here at the front, so swarm them as soon as we're done. But for now, let's get a couple of questions for, go ahead. Yes, over there. Thank you, great talk. So two questions. Over there. Thank you.
Video Content:
The OpenThoughts Team Sara Gubler, Senior Product Manager, Microsoft Niklas Hultman, Senior Product Manager, Microsoft Miguel Rovira, Senior Product Manager, Amazon Web Services Eugene Sivak, Senior Product Manager, AWS Katie Johnson, Senior Product Manager, Microsoft Sarah Smith, Senior Product Manager, Microsoft David Lohr, Senior Product Manager, Microsoft Tina Wang, Senior Product Manager, Microsoft Tommy Zhang, Senior Product Manager, Microsoft Yiwen Chen, Senior Product Manager, Microsoft Liu Zhen, Senior Product Manager, Microsoft OpenThoughts.ai Microsoft AWS SMOQ
Audio Transcript:
Great talk. So two questions. One is if you're just using SFT on this data, what's the difference between this and regular SFT? This is just regular SFT. Oh, yeah. Oh, okay. So then how is regular SFT able to make the models, like think longer? Because I thought for the reason of models, they have like a thinking block and they think for hours and minutes.
Video Content:
The video features a presentation by The OpenThoughts Team, which is associated with AIE. The background is a solid yellow color, and there is a QR code on the right side of the screen. Below the QR code, there is a list of team members: Daniel Gudde, Kevin Hens, Andrew Judd, Michael Rappel, Samantha Dumas, and others. At the bottom of the screen, there are logos for Microsoft, AWS, and SMOB. The presenter is standing at a podium, speaking into a microphone.
Audio Transcript:
Exactly. So how does SFT make it think for hours? So you're doing supervised fine tuning on the questions and the answers also contain the thinking. So the model learns to use its context window and produce these long thinking traces. So it can do this. People call SFT imitation, but it can learn to learn this format in the same way.
Audio Transcript:
Thanks. All right, we'll take one from this time. Great presentation, Ryan. We'll take one from this side. Great presentation, Ryan. One question, why do you think a smaller model, like when 32B was a better teacher than a deep seek R1. What was your insight in figuring out that like a good professor makes a bad lecturer. Yeah, that's a great question.
Audio Transcript:
I think this is something we need to investigate more, but you can see that when you look at charts of the length of reasoning traces, you can see the distributions are different. So it might be the case that you're using more of your context window, using more tokens, more steps. It also might be the case that you just have a better formatted response, better output.
Video Content:
The video features a presentation by The OpenThoughts Team, likely discussing topics related to artificial intelligence (AI) and technology. The background is a gradient of yellow and black, with a QR code prominently displayed on the right side. Below the QR code, there is a website URL: OpenThoughts.ai. At the bottom of the screen, logos for Microsoft, AWS, and SMOO are visible. The presenter, dressed in a dark shirt, stands at a podium, gesturing as he speaks. The overall theme suggests an informative session about AI and its applications.
Audio Transcript:
This is like another great open research question. Interesting, I'll also say on this point, we also tried Claude as a teacher, which is like a very, as a good strong model, and it was just a terrible teacher. So there's, yeah, it's interesting what can, what actually creates a good teacher. All right, we'll take one more very brief question from this side and then those of you still waiting on questions
Video Content:
The video features a presentation by The OpenThoughts Team, which includes members such as Daniel Glaeser, Kevin Meehan, Michael Rappel, Samantha Sperling, and others. The team is associated with Microsoft, AWS, and SMOQ. The background is a gradient of yellow and black, with a QR code and the text "OpenThoughts.ai" displayed prominently. The speaker is standing on a stage, addressing an audience.
Audio Transcript:
after we have closed this up, it's warm-in-law. So great talk around. We're doing similar kind of thing thing but I just had a question do you guys have any like pattern map as to in the reasoning chain of thought when things don't work, at what level, you know, in the e-val, do you find out that things are not working or it's not reasoning correctly. Is there a pattern map or something
Video Content:
The video features a presentation by The OpenThoughts Team, which includes members such as Daniel Glaude, Kevin Hensley, Matt Hensley, Michael Rappel, Cameron Sperling, and others. The team is associated with Microsoft, AWS, and SMOQ. The background is a gradient of yellow and black, with a QR code and the text "OpenThoughts.ai" displayed prominently. The speaker is standing on a stage, addressing an audience.
Audio Transcript:
that you have in your open source? Sorry, I didn't catch that. Is there a... So if there are five steps of reasoning to reach a final conclusion, at what step does a reasoning go awry? Yeah, this is a great question. what step does a reasoning go awry? Yeah, this is a great question. We don't do this fine-grained analysis, but there is a ton in the literature about this where, yeah, there's a sort of critical step where it gets things wrong
Audio Transcript:
we did like the simplest thing possible, right? You could also go in and try to do more complicated things. At evaluation time where you're doing interventions to maybe detect steps that have gone awry and change. Or you can do this when you're creating the data set. So you could potentially rewrite things.
Audio Transcript:
But everything that we tried in terms of like messing with the reasoning trace it wasn't helpful so yeah i think there's still more to explore there. There's like, this is really just the start of everything and reasoning. Awesome. Thank you everyone for your questions. And one more time, thank you to Ryan for everything you shared. All right, and again, we'll be back in this room at 2 o'clock sharp we have three more presentations looking forward to see you then Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I don't Thank you. You.
Video Content:
The video begins with an introduction to the OpenThoughts Team, featuring a QR code linking to their website, OpenThoughts.ai. The team members are listed as Daniel Gudes, Ron Hovav, Michael Riegl, Eugene Speransky, and Yonatan Zilberstein. The video then transitions to a panel discussion on AI and machine learning, focusing on three key topics: "How to Train Your Agent: Building Reliable Agents with RL" by Kyle Craftt, "What Reinforcement Learning with Verified Rewards Changed" by Nathan Lambert, and "Towards Verified Superintelligence" by Christian Szegedy. The panelists are seated in front of a backdrop displaying logos for Microsoft, AWS, and SMOO. The discussion appears to be part of a conference or seminar, with the event details shown on the screen.
Audio Transcript:
I'm and Thank you. Thank you. I.
Video Content:
Reasoning & RL Workshop Overview Keynote: How to Train Your Agent: Building Reliable Agents with RL Speaker: Kyle Craftt Panel Discussion: What Reinforcement Learning with Verified Rewards Changed Speakers: Nathan Lambert, Christian Szymczyk Sponsors: Microsoft, AWS, SMOKO
Audio Transcript:
I'm the of Thank you. Thank you. and Thank you. and I think and Thank you. Thank you. Thank you. and and Thank you. and Thank you. You're
Video Content:
Reasoning & RL - OpenAI AIE How to Train Your Agent: Building Reliable Agents with RL Katie Craftt OpenAI 7:00 PM What Reinforcement Learning with Verified Rewards Changed Nathan Lambert A2Z 8:00 PM Towards Verified Superintelligence Christian Szegedy Google 9:00 PM Sponsored by: Microsoft AWS SMOKY
Video Content:
Reasoning & RL: How to Train Your Agent: Building Reliable Agents with RL Nathan Lambert, AWS Towards Verified Superintelligence: What Reinforcement Learning with Verified Rewards Changed Christian Szegedy, Google Brain Microsoft, AWS, SMOO
Audio Transcript:
talking Thank you. Thank you. and Thank you. So, um, Thank you. Now, Thank you. Thank you. Thank you. The Thank you. Thank you. We're always We're Thank you. Thank you.
Video Content:
Reasoning & RL: How to Train Your Agent: Building Reliable Agents with RL What Reinforcement Learning with Verified Rewards Changed Towards Verified Superintelligence Microsoft AWS SMOO
Audio Transcript:
Thank you. Thank you. Thank you. and and Thank you. Thank you. Thank you. and Thank you. Thank you.
Video Content:
Reasoning & RL OpenUp 2023 AIE June 6-8, 2023 Microsoft AWS SMOKO How to Train Your Agent: Building Reliable Agents with RL Katie Craftt Open Up What Reinforcement Learning with Verified Rewards Changed Nathan Lambert AIE Towards Verified Superintelligence Christian Szegedy AIE
Audio Transcript:
Thank you. The I think Thank you. Yeah, well, the other than this is, Thank you. Thank you. Thank you. You did a good right?
Video Content:
Reasoning & RL OpenPener AIE 2023 June 4-6, 2023 How to Train Your Agent: Building Reliable Agents with RL What Reinforcement Learning with Verified Rewards Changed Towards Verified Superintelligence Microsoft AWS SMOKY
Video Content:
Reasoning & RL Workshop Overview Keynote: How to Train Your Agent: Building Reliable Agents with RL Speaker: Kyle Craftt Topic: How to Train Your Agent: Building Reliable Agents with RL Date: June 4, 2023 Time: 10:00 AM - 11:00 AM Location: OpenPener Panel Discussion: What Reinforcement Learning with Verified Rewards Changed Speaker: Nathan Lambert Topic: What Reinforcement Learning with Verified Rewards Changed Date: June 4, 2023 Time: 11:00 AM - 12:00 PM Location: OpenPener Panel Discussion: Towards Verified Superintelligence Speaker: Christian Szegedy Topic: Towards Verified Superintelligence Date: June 4, 2023 Time: 12:00 PM - 1:00 PM Location: OpenPener Sponsors: - Microsoft - AWS - SMOO
Audio Transcript:
I'm worried that you know Thank you. Thank you. I don't know. I think of it. Thank you. and I'm Thank you. I don't know. I'm not Thank you. I'm I'm going to I'm I'm and Thank you. Thank you. Thank you. So, you know, Thank you. Thank you. Thank you.
Video Content:
Reasoning & RL Workshop Overview The video segment begins with an overview of the Reasoning & RL workshop, which is sponsored by OpenAI. The event is scheduled for June 4th and 5th, 2023, at AIE. The workshop features three key sessions: 1. **How to Train Your Agent: Building Reliable Agents with RL** - Speaker: Kyle Craftt - Description: This session focuses on techniques for training agents using reinforcement learning (RL) to build reliable systems. 2. **What Reinforcement Learning with Verified Rewards Changed** - Speaker: Nathan Lambert - Description: This session discusses the impact of verified rewards in reinforcement learning, exploring how they have transformed the field. 3. **Towards Verified Superintelligence** - Speaker: Christian Szegedy - Description: This session delves into the concept of verified superintelligence, discussing advancements and challenges in this area. The workshop is supported by Microsoft, AWS, and SMOOCH, as indicated by logos displayed on the screen.
Audio Transcript:
I have and Thank you. Thank you. Thank you. I don't know Thank you. Thank you. Thank you. We have
Video Content:
A speaker is presenting at an event titled "Reasoning & RL" sponsored by OpenAI. The event features three sessions: "How to Train Your Agent: Building Reliable Agents with RL" by Kyle Craftt, "What Reinforcement Learning with Verified Rewards Changed" by Nathan Lambert, and "Towards Verified Superintelligence" by Christian Szegedy. The event is taking place on June 4th, 2023, from 10:00 AM to 5:00 PM. The background includes logos of Microsoft, AWS, and SMOOCH.
Audio Transcript:
We have one of the Thank you. Thank you. Thank you.
Video Content:
Reasoning & RL - OpenAI AIE How to Train Your Agent: Building Reliable Agents with RL Katie Craftt OpenAI What Reinforcement Learning with Verified Rewards Changed Nathan Lambert AIAI Towards Verified Superintelligence Christian Szegedy Google Microsoft AWS SMOO
Video Content:
Reasoning + RL OpenAI AIE How to Train Your Agent: Building Reliable Agents with RL What Reinforcement Learning with Verified Rewards Changed Towards Verified Superintelligence Microsoft AWS SMOKING Christian Szymanski
Audio Transcript:
Thank you. and Thank you. Thank you. Thank you. I Thank you. Thank you. I'm all the you know Thank you. Thank you. I'm
Video Content:
The video is a recording of a panel discussion at an AI conference titled "Reasoning & RL." The panel features three speakers discussing various topics related to reinforcement learning (RL) and AI. The first speaker, Kyle Craftt from OpenAI, discusses how to train agents by building reliable agents using RL. The second speaker, Nathan Lambert from AWS, talks about what reinforcement learning has changed with verified rewards. The third speaker, Christian Szegedy from Google, presents on "Towards Verified Superintelligence." The panel is sponsored by Microsoft, AWS, and SMOO. The video includes slides with titles and brief descriptions of each speaker's topic.
Audio Transcript:
I'm I'm Yeah. Thank you. I don't know So I mean, this scrap is only that one?
Video Content:
Reasoning & RL: AIE How to Train Your Agent: Building Reliable Agents with RL What Reinforcement Learning with Verified Rewards Changed Towards Verified Superintelligence Speaker: Kyle Craftt Speaker: Nathan Lambert Speaker: Christian Szegedy Sponsored by: OpenAI, Microsoft, AWS, SNOOK
Audio Transcript:
Back in the graphics tool. Do you have to do Do I set this up for the coverbook? No. Or John on the front? No. Is that all you wanted? Let's make sure. Is that all you wanted? I think he's watching on the screen.
Video Content:
Reasoning + RL Webinar Series Sponsored by OpenPype June 9, 2023 Webinar Schedule: 2:00 PM - 2:20 PM: How to Train Your Agent: Building Reliable Agents with RL (Kyle Costant, OpenPype) 2:20 PM - 2:40 PM: What Reinforcement Learning with Verifiable Rewards Changed (Nathan Lambert, A21) 2:40 PM - 2:50 PM: Towards Verified Superintelligence (Christian Szepeszy, A21)
Audio Transcript:
I think he's watching on the screen Oh, yeah. This one is the . Oh, the GM? No, this is a on the street now they don't see this. What's up? B.
Video Content:
Reasoning + RL is a series of talks sponsored by OpenPipe. The event took place on June 3, 2023, at Web Scale 5. The schedule included three sessions: 1. **How to Train Your Agent: Building Reliable Agents with RL** - Speaker: Kyle Cortez - Duration: 2 hours and 2 minutes 2. **What Reinforcement Learning with Verifiable Rewards Changed** - Speaker: Nathan Lambert - Duration: 2 hours and 2 minutes 3. **Towards Verified Superintelligence** - Speaker: Christian Szepeszy - Duration: 2 hours and 26 minutes
Audio Transcript:
B.U. Back up. Back up. This one I hear the graphics fools from the stage. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Testing, testing. testing testing what the fuck Testing, testing. So that's Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. All right.
Video Content:
Reasoning + RL is a series of talks sponsored by OpenPipe. The event took place on June 3, 2023, at Web Scale 5. The schedule included three talks: 1. **How to Train Your Agent: Building Reliable Agents with RL** - Speaker: Kyle Cortez - Duration: 2 hours and 2 minutes 2. **What Reinforcement Learning with Verifiable Rewards Changed** - Speaker: Nathan Lambert - Duration: 2 hours and 2 minutes 3. **Towards Verified Superintelligence** - Speaker: Christian Szepeszy - Duration: 2 hours and 26 minutes
Audio Transcript:
Hey everyone. Glad you're all here. This is the reasoning and reinforcement learning track on the afternoon of the last day of the AI Engineer World's Fair. Glad you're all here, glad you're sharing it with us. To let you know what the schedule is going to be, so I'm going to be talking right now. I'll get myself an introduction in a moment. Directly after this, we're going to have Nathan Lambert from AI2. He'll be coming up.
Video Content:
A speaker is presenting at an event titled "Reasoning & RL." The event is sponsored by OpenPipe, Microsoft, AWS, and SMOJO. The speaker, Kyle Corbitt from OpenPipe, is discussing how to train agents using reinforcement learning (RL). He mentions that building reliable agents with RL requires careful consideration of reward structures and the importance of verifying rewards. The presentation includes slides with titles such as "How to Train Your Agent: Building Reliable Agents with RL" and "What Reinforcement Learning with Verified Rewards Changed." The speaker emphasizes the need for transparency and verification in RL to ensure the reliability and trustworthiness of the agents.
Audio Transcript:
Very excited to see that. And then Chris Segety, formerly of XAI, one of the co-founders there, will be speaking last. So hopefully you're all able to stay for those sessions as well. I think they'll be excellent. To introduce myself, my name is Kyle Corbett. I am one of the co-founders of a company called OpenPipe. What we do is we do reinforcement learning to make agents more reliable.
Video Content:
Building Reliable Agents with RL Kyle Corbitt | OpenPipe @corbitt Microsoft | AWS | SMO
Audio Transcript:
We work with large companies, usually Fortune 500s, things like that, and help them make their agents more reliable so they can deploy them in production. If that describes any of you and you want to talk to me after, I'm happy to chat later. But today what I'm going to talk about is a very specific case study that we did. This case study, I'm going to talk about lessons learned very concretely, what did and didn't work, how are we able to build an agent that worked well with reinforcement learning.
Video Content:
Building Reliable Agents with RL Kyle Corbitt | OpenPipe @kcorbitt Microsoft | AWS | SMOB
Audio Transcript:
All of this, everything that I'm talking about in this presentation, this is an open source code base that we built. We wanted to share these learnings, and I'll share that link with you at the end as well, for those of you who want to replicate what we did. So what is the project we're going to be talking about? It's a project called ArtE. It is a natural language assistant that helps you
Video Content:
Building Reliable Agents with RL Kyle Corbitt | OpenPipe @kcorbitt Microsoft | AWS | SMOB
Audio Transcript:
answer questions from your email inbox. So I'll give you an example of what we're talking about here. Let's say you want to ask, you know, in this case, our example question is, when is Sherry's moved to Portland targeted for? So you would ask this question to the assistant. It then goes and it searches your inbox. It's got several tools. It has like a search tool. It has a read email tool. And then it can actually answer the final question. You can kind of see if you look here,
Video Content:
Case Study: ART-E Search your inbox with natural language AIE Microsoft AWS SMOOT
Audio Transcript:
what's going on behind the scenes. This is important to get a sense of kind of how this agent works. And as we're talking through how we built it, how we made it work, hopefully that helps make the conversation very grounded in a specific task. So anyway, you see the agent, it's, you know, searching for certain keywords, it gets those messages back, it's in reading one of them, and then answering the question. That's what it does.
Audio Transcript:
Okay, so question, you know, once we've decided this is kind of the task we're trying to solve, why would you use reinforcement learning for this specifically. And the answer is like to start with you shouldn't. In fact, to start off with we did not. So the first version of this agent, once we decided we wanted to build this, we didn't use any reinforcement learning at all.
Video Content:
Step One: Build Prompted Agent 1. Create a prompt for the agent to respond to. 2. Train the agent using the prompt. 3. Test the agent's responses.
Audio Transcript:
We purely built this on prompted models. And this is the first lesson from this talk that I want to share is I would generally always recommend starting with getting the best performance you can with a prompted model before going to any training, including reinforcement learning. There's a few different reasons to do that, three specifically. The first one is just like working out the bugs in your environment, right? You know, maybe your tools aren't implemented properly,
Video Content:
Step One: Build Prompted Agent 1. Ensure tools work
Audio Transcript:
maybe they don't have access to the data you think they do. We find this happens a lot, and it's a lot less frustrating to debug that separately from debugging your training loops. So you want to make sure that you can get at least some kind of performance before you start training. And then second of all, you may find, as you're trying to improve the performance on using these prompted models
Video Content:
Step One: Build Prompted Agent 1. Ensure tools work 2. RL maybe not needed
Audio Transcript:
that you can get it working really well, and that's great. So that means you don't need to train anything, and that saves you a lot of time. There's a third reason as well that I'll share, which is basically once you've gone to that effort, you've done your best to get the best quality prompted baselines you possibly can. Then if you find that those baselines are not able to get you where you need to go, and you're able to surpass them
Video Content:
Step One: Build Prompted Agent 1. Ensure tools work 2. RL maybe not needed
Audio Transcript:
with reinforcement learning, it feels great. You get to glow it and be like, yes, I was able to beat the frontier model's on my task. I recommend it. Feels good. You can post on X about it. There's nice graphs and stuff. So this is what it looks like when everything goes right. So this is an example of a training run for this ART model that I'm going to be talking about.
Audio Transcript:
You can see that there's these lines for each of the prompted model baselines that we've got. So we've got 0.3, 04 mini, and then Gemini and 4.1. And you can see those ones, you know, they have certain level performance. And then you can see this sort of moving line that's going on. This is the model that we trained. And you can see it actually starts out significantly worse than these other models from the start.
Video Content:
A speaker is presenting in front of a large screen displaying a graph titled "Fraction of Questions Answered Correctly." The graph compares the performance of different AI models over time, with labels such as "AIE," "HART E (Dover 2.3 + 16B)," "s4d-rain," "Gemini 2.0 Pro," and "GPT-4.1." The x-axis represents the number of steps taken by the models, ranging from 0 to 550, while the y-axis shows the fraction of questions answered correctly, ranging from 0 to 1. The speaker is gesturing towards the graph, explaining the trends and improvements in the models' performance.
Audio Transcript:
That's because we started from a QEN 2.5, the 14 billion parameter one. It's a relatively small model, relatively weak model, and so it was doing much worse than these initially. But you can see as training progresses, you know, initially at the beginning, it's sort of maybe there's, it's learning the right way to do tool calls. There's a very sharp bump as it figures out the basic stuff, and then a more gradual climb until eventually it's
Video Content:
A man in an orange shirt is giving a presentation on a screen displaying a graph titled "Fraction of Questions Answered Correctly." The graph compares the performance of different AI models over time, with labels such as "AIE," "HART E (Dover 2.3+ 16B)," "s4d-stable," "Gemini 2.0 Pro," and "QPTT 4.1." The x-axis represents the number of steps taken by the models, ranging from 0 to 550, while the y-axis shows the fraction of questions answered correctly. The line labeled "AIE" shows a steady increase in accuracy, while the other models show varying levels of performance. The presenter points to the graph and discusses the results, emphasizing the superior performance of AIE compared to the other models.
Audio Transcript:
able to significantly outperform any of the prompted models on this task. And this is sort of what you're, you know, in the ideal case, when everything works, this is what you're looking for. This is what you're hoping to achieve. This is another view actually of that same data we were just looking at. I wanted to highlight it in this way because it's important to realize, so on the last graph it looked like the lines sort of asymptote out pretty close together.
Video Content:
A speaker is presenting at a conference or seminar, standing behind a podium with a microphone. The background features a large screen displaying graphs and data related to artificial intelligence (AI) and machine learning (ML). The graphs show the fraction of questions answered correctly over time, with different models such as AIE, HART E, and others being compared. The speaker gestures towards the screen while explaining the data, emphasizing the performance of various AI models. The presentation includes logos for Microsoft, AWS, and SMOOCH, indicating the involvement of these companies in the research or development discussed.
Audio Transcript:
That's because they're getting near 100%. But the last, you can see, for example, with our best prompted model here, 03, it's 90% accuracy. And with our RL model, we're able to get up to 96%. And so one way to think about that is like 60% of the errors that O3 was making are actually solved with our model,
Video Content:
A man is standing in front of a large screen displaying a bar graph titled "Percentage of Questions Answered Correctly". The graph compares the performance of different AI models across various datasets. The x-axis represents the size of the dataset, ranging from 2.5 GB to 43 GB, while the y-axis shows the percentage of questions answered correctly. The bars are color-coded: orange for QL, brown for GPT-4.6, green for Gemini, blue for 04-mini, and purple for a3. The man is gesturing towards the screen as he speaks, likely explaining the data presented.
Audio Transcript:
which is quite a large, you know, we find that that's actually can be very, very important for the user experience of someone using one of these. If you're getting just half as many errors, that can make the product much stronger. So this is where we got to an accuracy. There's a couple other metrics that we find are often very, very important. And the trade-off between these does, is very task-dependent, but they matter in many cases.
Video Content:
A speaker is presenting at an event, standing in front of a large screen displaying a bar chart titled "Percentage of Questions Answered Correctly." The chart compares different AI models: AIE, BERT, GPT-4.6, Gemini, and a3. Each bar represents the percentage of questions answered correctly by each model. The AIE model has the highest percentage at 80%, followed by BERT at 71%, GPT-4.6 at 78%, Gemini at 58%, and a3 at 3%. The presenter is discussing the performance of these models, emphasizing the significance of the AIE model's high accuracy.
Audio Transcript:
Cost obviously is a big one. So for this email, agentic harness that we had, we benchmarked the cost on 0.3, 04, mini, and our model. So if you wanted to do like 1,000 searches using 033 that's going to cost $55 which is a lot I think for most use cases that probably would be cost prohibitive just from a unit economics point of view on 04 mini we're down to eight dollars but that's still quite
Video Content:
A presenter is giving a presentation about AI and cloud computing services. The screen behind him displays a bar graph comparing the cost per 1K runs for different cloud service providers: Azure (555.19), AWS (57.86), and Google Cloud Platform (63.85). The presenter is standing in front of a black background with logos of Microsoft, AWS, and Google Cloud Platform displayed at the bottom of the screen.
Audio Transcript:
expensive and then we drop another order of magnitude by moving to this smaller Quinn 2.514b. Again this is just driven it by it being a much smaller model so it's much cheaper to run, but we're still able to get very good performance because we've specialized it on our task. Beyond cost and the accuracy, the third metric that often comes up is latency, particularly if you're doing, I mean, certainly anything with voice,
Video Content:
A presenter is giving a presentation about AI and machine learning. The screen behind him displays a bar graph comparing the cost per 1K runs for different models: AIE, g3, and Qwen 2.5 148B + VLM. The presenter explains the data, highlighting the cost differences between these models. He also mentions the full-run latency for each model, showing that Qwen 2.5 148B + VLM has the lowest latency at 1.96 seconds. The presenter emphasizes the importance of considering both cost and performance when choosing an AI model.
Audio Transcript:
but if there's any real-time human interaction with the task, latency is going to matter a lot. And we were able to find on this task, we were able to get significantly better latency. There's a number of different ways, which I'll go into a more detail later, that we were able to achieve this. One was just, again, moving to a smaller model helps. There's just less loading from memory, less matrix multiplies. It's just you're able to get tokens out faster.
Video Content:
A presenter is giving a presentation about AI and machine learning technologies. The slide behind him shows a bar graph comparing different models' full-run latency times in seconds. The models are labeled as 's3', '04-min', and 'Dover 2.5 16B + HCL'. The presenter is standing in front of the slide, gesturing towards it while speaking. The background includes logos for Microsoft, AWS, and NVIDIA.
Audio Transcript:
We were also able to train this model to have fewer turns going back and forth with the database, with the actual email, the list of emails. We were able to train it to be more efficient with its queries, and I'll go to that in a moment, and so that leads to our latency. There's actually a third thing, which we didn't apply here, but can help a lot with these smaller things, which is called speculative decoding. That's something you can do on large or small models. It generally works better
Video Content:
A presenter is giving a presentation about AI and machine learning technologies. The screen behind him displays a bar graph comparing different models' full-run latency times in seconds. The models are labeled as 's3', 's4-min', and 'Dover 2.5 16B + HCL'. The presenter explains the differences between these models, highlighting the benefits of using AI Accelerated Execution (AIE) technology. The presentation is sponsored by Microsoft, AWS, and NVIDIA.
Audio Transcript:
on smaller tasks-specific models because you get higher acceptance rates on your speculator. But basically, there's lots of reasons why smaller models work better. Okay, so then the next question, for those of you who haven't done this yet, is like, okay, what is the effort required to do this to actually achieve these results? If you'd asked me this question a year ago, I would say, hey, you should really only be doing this
Video Content:
A presenter stands at a podium, addressing an audience. The background features a large screen displaying a bar graph titled "Full-Run Latency (Seconds)". The graph compares three different configurations: g3, p4-min, and Dev2 2.5 168 + ML. The presenter discusses the performance differences between these configurations, emphasizing the impact of machine learning on latency. The screen also displays the logos of Microsoft, AWS, and SMOO, indicating the collaboration or sponsorship of the event.
Audio Transcript:
if you're this big company and willing to put you know months of work into a project i think that's changing i honestly do. In this case, so this training run, it cost us about $80 in GPU time. It did take about a week of engineering time to build this, and caveat that was with an engineer who is familiar with this domain and had quite a lot of experience with machine learning and RL.
Video Content:
How Hard Is This? Cost: $80 Time: 1 week (and dropping) AIE Microsoft AWS SMOJ
Audio Transcript:
But I actually expect as we figure out the right patterns here, collectively as an industry, this will keep dropping. And I expect that industry, this will keep dropping. And I expect that the sort of payback period to get a return on investment from these specialized bottles is actually going to continue falling as well. And, you know, part of the reason I wanted to give this talk is to sort of distribute that knowledge we learned
Video Content:
How Hard Is This? Cost: $80 Time: 1 week (and dropping) AIE ...and I'll show you how to do it Microsoft AWS SMOOT
Audio Transcript:
and hopefully move faster towards that world where this is just sort of like a thing everyone knows how to do and it's very easy and very fast. So that's what we'll be talking about for the rest of time is some more of the lessons we learned. Okay, so when you are using RL to train an agent or really using URL for anything else, I find that consistently with different problems we look at,
Video Content:
and I'll show you how to do it The two hard problems in modern RL 1. Realistic Environment 2. Reward Function
Audio Transcript:
there are sort of two hard problems that come up every single time. All right? And the two hard problems are, first of all, figuring out a realistic environment, right? So if you're training an agent, you need to be training it with realistic data, with realistic inputs and outputs, tools available, everything like that, to how it's going to be used in production. Because if you don't, then it's going to be optimizing for the wrong thing,
Video Content:
The two hard problems in modern RL 1. Realistic Environment 2. Reward Function
Audio Transcript:
and you won't get the results you want when you deploy it. And then the second thing, which the results you want when you deploy it. And then the second thing, which sometimes is hard, sometimes isn't, this one is a little bit task dependent, is getting the right reward function. So reward function, that just means you have to be able to know when your agent's gone through and say, in this case, give it an answer to my email. You have to have some way of knowing, give it an answer to my email, you have to have some way of knowing, did it do a good job or a bad job. All right, that's the reward function. It decides, it's how you decide if it's good or it's bad.
Video Content:
The two hard problems in modern RL 1. Realistic Environment 2. Reward Function
Audio Transcript:
Depending on the domain, sometimes that's really easy. We have, I don't know if Nathan's here, he's going to be talking next, but he and his team put together this thing called RLVR, which in some verifiable domains, it's actually very easy to do a reward, oftentimes, not all domains are like that. Oftentimes it is kind of hard, and so it's somewhat task dependent. I'm going to go through how we solve these problems
Video Content:
The two hard problems in modern RL 1. Realistic Environment 2. Reward Function
Audio Transcript:
specifically with RE. OK, first one, realistic environment. So for our RE task, what is the environment we need? What is the environment this agent is going to be operating in? Well, it needs these tools available. It needs to be able to go in query an email inbox. It needs to be able to get emails back and that look realistic. These emails, you know, the inbox should be large, because that's what most email inboxes are like.
Video Content:
The two hard problems in modern RL 1. Realistic Environment 2. Reward Function We need email inboxes that are Large Diverse Realistic
Audio Transcript:
The emails in it should be diverse, and they have to look kind of like real emails. So this could be kind of hard because you can't just go ask like a thousand people to give you their personal emails to train on. Luckily, in this case, we were able to solve this with the help of a company that has contributed a lot to just the open data ecosystem generally. It's like quite an iconic company.
Video Content:
We need email inboxes that are Large Diverse Realistic
Audio Transcript:
Perhaps I would call it a historic company. I'm of course talking about Enron. I'm hearing some laughter. So anyway, Enron was a, there were a financialized energy company in the 90s and 2000s, committed massive fraud, ended up getting shut down by the Department of Justice. As part of this, getting shut down by the Department of Justice. As part of this process, the court cases they were going through,
Video Content:
We need email inboxes that are Large Diverse Realistic
Audio Transcript:
a dump of like 500,000 of their emails was released to the public as part of the discovery process. So that's great for things like this, and that's what we used as our environment for the email inboxes. All right, so now we've got realistic email inboxes with tens of thousands of emails that are real emails back and forth. Now we have to design our reward function. So as our agent is going and as our agent is, you know, we're asking it questions,
Video Content:
Thanks Enron! Problem 2: Reward Function AIE Microsoft AWS SMR
Audio Transcript:
and then it's giving us answers, we have to know is the answer correct or not, so we can reward it when it gets the answer right, and it can learn to do that better. There's different ways, and this part is very task dependent. The way that we went about it in this case was we basically turned it into a more of a verifiable problem. And the way we did that was we actually took our email inbox,
Video Content:
Problem 2: Reward Function Validated Synthetic Data
Audio Transcript:
we sort of inverted the problem. We grabbed batches of 20 emails at a time from the inbox and gave them to Gemini 2.5 Pro and said, hey, given this set of emails, give us a few questions that a user might realistically ask that the answers are found in this email, right? And so Gemini generated the questions, it generated the answers, and then of course, the source emails that came from.
Video Content:
A speaker is presenting on a topic related to "Validated Synthetic Data". The presentation includes a slide with bullet points detailing various aspects of the topic, such as data generation, validation, and usage scenarios. The speaker is standing in front of a screen displaying the slide, and there are logos of Microsoft, AWS, and SMOO at the bottom of the screen.
Audio Transcript:
And there were some extra steps on top of that. A lot of the questions it camp with looked a little bit unrealistic. We had a separate filtering step where we were like, OK, let's find the subset of these that actually look like questions that I would maybe ask. And we ended up with a list of a few thousand questions along with their verified answers. And so at this point, it becomes much more of a sort of verified thing.
Video Content:
A man in an orange shirt is standing on a stage, speaking to an audience. He gestures with his hands as he talks about "Validated Synthetic Data." The background features a slide with bullet points about AIE (Artificial Intelligence Engineering), Microsoft, AWS, and SMOO. The slide includes topics such as data generation, data quality, data privacy, data security, and data governance. The man appears to be explaining these concepts and their importance in the context of AI engineering.
Audio Transcript:
The reward function becomes much easier because we know what the correct answer should be. And so the way we can tell if our agent did a good job is we give our agent the question, we let it go and search the email inbox and try and find the right emails and everything and eventually comes back with an answer. And then we can just use an LM as judge, a very simple one, and say like, hey, here's the question, here's the golden answer that we believe is right.
Video Content:
A man in an orange shirt is standing in front of a screen displaying a presentation slide titled "Validated Synthetic Data." The slide lists various steps involved in creating synthetic data, such as data cleaning, data augmentation, and data distribution. The man appears to be explaining these steps to an audience. The slide also includes logos for Microsoft, AWS, and SMOO. The man gestures with his hands as he speaks.
Audio Transcript:
Here's the answer we got from our model. Is it right or not? We did have to do a little bit of iteration there, making sure that the judge was well calibrated on like what counts as correct or not. But by and large, this worked pretty well and was able to make this more of a verified task. So that's how we solved the reward function problem was by having that, you know, turning this into something
Video Content:
A man in an orange shirt is standing at a podium, speaking into a microphone. He gestures with his hands as he talks. The background is dark, and there is text on the screen behind him that reads "Reward: found answer = expected answer." There are logos for Microsoft, AWS, and SMOO in the bottom right corner of the screen.
Audio Transcript:
where we had more of a golden data set. Okay, so once you've solved that problem, those problems, once you have your environment, once you have your reward function to find, then basically you just kind of have to run a loop over and over and over again, where you have your agent go through and it tries to solve the problem, and then you figure out if it's good or it's bad,
Video Content:
A man in an orange shirt stands in front of a screen displaying the text "Reward: found answer = expected answer". He appears to be giving a presentation or lecture. The background includes logos for AIE, Microsoft, AWS, and SMOB.
Audio Transcript:
and then you just reward if it's good and punish if it's bad, and that's it. And you do this over and over and over again, and then hopefully, if you've got everything set up right, it learns what good looks like, it learns what bad looks like, and it starts doing it right. And then again, this is the curve we saw earlier, where you can see it, it starts getting better over time.
Video Content:
A man in an orange shirt is standing in front of a screen displaying a slide titled "Reward: found answer = expected answer." The slide also includes the text "Run in a loop!" and logos for AIE, Microsoft, AWS, and SMOB. The man appears to be giving a presentation or lecture.
Audio Transcript:
Okay, a few other like interesting learnings from this project. One thing is we found that there's actually, you can throw a lot of stuff into your reward function beyond just the primary thing you're trying to solve for. And so we actually ended up, there were like sort of eight different little things that we gave extra credit for. I'm going to share two of them here. So the first one here is we're trying to have it optimized for the number of turns,
Video Content:
A man stands at a podium, presenting in front of a large screen displaying the text "Bonus: Extra rewards!" along with logos for AIE, Microsoft, AWS, and SMOJ. The screen also shows graphs titled "Average Number of Tests to Answer Question" and "Fraction of Problems Misclassified." The presenter gestures towards the screen as he speaks, emphasizing the information being displayed.
Audio Transcript:
how many times back and forth, how many times it had to query the email inbox before it came up with the right answer, right? So because the most important thing, of course, is getting the answer right, but between two answers that both got a right, we would rather it took fewer turns back and forth because that's fewer tokens, that's lower latency, lower costs. It's just like a more efficient agent. So you can see here on this first graph that early on,
Video Content:
A speaker is presenting at a conference or seminar, standing behind a podium with a microphone. The background features a large screen displaying graphs and text related to artificial intelligence (AI) and machine learning (ML). The graphs show the average number of tokens required to answer questions over time, with different models labeled as AIE, BERT, and others. The speaker discusses the concept of 'Bonus: Extra rewards!' and mentions Microsoft, AWS, and SMOI logos. The audience is not visible in the frame.
Audio Transcript:
while it was getting its feet wet and figuring out what worked, it ended up spiking up to over six turns on average, so it would go back and forth a bunch of times with the email inbox and try and find the right thing. But then once it was able to like, and try and find the right thing. But then once it was able to figure out how to use the tools efficiently, figure out the right way to construct keywords and find the right email, it was able to get very efficient and actually faster, better than any of our prompted models
Video Content:
A man in an orange shirt is standing in front of a large screen displaying graphs and text. The screen has the words "Bonus: Extra rewards!" at the top. Below this, there are two graphs showing the average number of tests to answer questions over time, with one graph labeled "AIE" and the other labeled "AWS." There is also a graph showing the fraction of problems misclassified over time. The man appears to be explaining these graphs to an audience, gesturing with his hands as he speaks.
Audio Transcript:
on this metric of using fewer turns. And again, this was just because we gave it a little bit of extra, it was a very small amount relative to the reward for getting it right, but a little bit of extra credit on using for queue returns, and it was able to use that to optimize against that. Another extra reward function we gave it is to try and discourage it from
Audio Transcript:
hallucinating answers. So obviously the best thing is to get the right answer. If you can't find the right answer, it's much better to say, hey, I don't know, than to make up an answer in a situation like this. So we basically penalized it if the reward model said, hey, you got the answer wrong, but it had tried to give an answer, that was like a much lower reward
Video Content:
A man in an orange shirt is standing in front of a large screen displaying graphs and text. The screen has the title "Bonus: Extra rewards!" and mentions AIE, Microsoft, AWS, and SMO. The graphs show the average number of tests to answer questions over time, with different lines representing different methods. The man appears to be explaining the data on the screen.
Audio Transcript:
than if it just said, hey, I don't know, I can't solve this problem. And as you can see, that worked quite well, compared to any of the prompting models, including O3, we ended up with a significantly lower hallucination rate because that was part of our reward function. Again, these are things that are just sort of like extra credit, but we found that you can throw in a bunch of these and it can jointly optimize all of them at the same time, which is super
Video Content:
A man in an orange shirt stands at a podium, presenting a slide deck. The slide deck includes two graphs and text. The first graph shows the average number of times to answer a question over time, with a peak around 2005 and a decline afterward. The second graph shows the fraction of problems misclassified over time, also peaking around 2005. The text on the slide reads "Bonus: Extra rewards!" and includes logos for AIE, Microsoft, AWS, and SMOO. The man gestures towards the slides as he speaks.
Audio Transcript:
powerful. Okay, I want to talk a little bit about reward hacking. It's something that comes up a lot when you're trying to do this, and it's kind of a fun thing to talk about. This is an iconic video some of you might have seen. This was released by Open AI almost a decade ago at this point of, they were trying to, they had this environment where you were trying to get this boat to complete a race,
Video Content:
A speaker is presenting at a conference or seminar, discussing the topic of "Reward Hacking". The presentation includes slides with graphs and data related to AIE (Artificial Intelligence Engineering) and Microsoft's AWS services. The speaker is standing behind a podium with a laptop, and there is a large screen displaying the title "Reward Hacking" and various graphs. The audience is not visible in the frames.
Audio Transcript:
and instead of learning to complete a race. And instead of learning to complete the race, it learned that, oh, if I just go in this little circle that's not even part of the racetrack, I can just get a bunch of points. And so I just started doing that over and over and over again instead of actually following. This is something that comes up a lot if you're doing reinforcement learning. And it's basically just the difference between what you actually want the model to do and what you can measure, like,
Video Content:
A man in an orange shirt stands at a podium, presenting a lecture on reward hacking. The screen behind him displays a video game interface with a boat navigating through water, collecting items and avoiding obstacles. The game interface includes a map, a score counter, and various buttons. The presenter speaks about the game mechanics and how they relate to reward systems. The video also shows a graph labeled 'reward' with a line graph indicating a positive trend over time.
Audio Transcript:
what you're actually rewarding it for. And if you almost always, if you let one of these run long enough, it will figure out some way to exploit your measure, and it will figure out some way to get a really high reward without actually solving the problem. And you need to just watch for that. So I'm going to give a couple examples here. This is a graph from another project, actually, not this one.
Video Content:
A man in an orange shirt is giving a presentation on a stage. He is standing in front of a large screen displaying a graph with the title "Is this good?" The graph shows a line that starts at around 0.1 and rises sharply to about 0.3 by the end of the graph. The man is gesturing with his hands as he speaks, and there are logos for Microsoft, AWS, and SMOF visible in the bottom right corner of the screen.
Audio Transcript:
So an engineer on our team was working on this game called NYT Connections. Some of you might know. You get 16 words and you have to put them in like four groups of four. It's quite a challenging game, especially for these language models because it requires a lot of world knowledge and like, you know, lateral thinking. Anyway, so they were trying to like, you know, lateral thinking anyway. So they were trying to train this model to do it. And it wasn't figuring out, wasn't figured out, what if it wasn't figured out?
Video Content:
A man in an orange shirt is giving a presentation on a stage. He is standing in front of a large screen displaying a graph titled "w/forward". The graph shows a line that starts at a low value and rises sharply as the x-axis increases. The man is gesturing with his hands as he speaks, indicating that he is explaining something about the graph. There are logos for Microsoft, AWS, and SMOF displayed on the screen behind him.
Audio Transcript:
And then boom, you can see here around step 40, it just like takes off. And it's like, okay, we figured out how to solve this. And this engineer, I'm going to call out. Where's, where's, where's, on our team. He's here at the conference, yeah. He's great, you should talk to him after. But he was like, hey, we solved it. Like we got NYT connections. And it's like, okay, the graph looks good. Let's look at what it's actually doing. What it was actually doing is it had figured out there was a bug
Video Content:
A man stands at a podium, gesturing towards a large screen displaying a graph titled "w/forward." The graph shows a steep upward trend, starting from a low value and rapidly increasing. The man appears to be explaining the data, possibly discussing the implications of the graph's results. The background includes logos for Microsoft, AWS, and SMOF.
Audio Transcript:
in how we wrote the verification. And if it just put every single word in every single category, it was able to get a perfect score because we weren't It was able to get a perfect score because we weren't verifying that they were, in fact, only four words in each category. So this is another example. This is a fun one. So I was training a model to produce really good titles for hacker news, titles that would get a thing upvoted.
Video Content:
A man in an orange shirt is giving a presentation on a stage. He is standing in front of a large screen displaying a slide titled "#likeaboss". The slide contains a list of tasks and a graph showing the performance of a machine learning model. The man is speaking and gesturing towards the screen as he explains the content. The background includes logos for Microsoft, AWS, and SMOJ.
Audio Transcript:
So I had this reward model I'd trained on existing hacker news articles and how many upvotes they got, and I was trying to train this model to produce new titles. And it was working really well for a while. You can see, and sort of subjectively as well, I looked at a bunch of these titles generated, and for these first like thousand steps or so, it was actually learning things that I was like, okay, as someone who spends way too much time on hacker news, yeah, that does look like a good title.
Video Content:
A man in an orange shirt is standing at a podium, speaking into a microphone. He is presenting on a screen behind him, which displays a graph titled "v2d forward/reverse". The graph shows a line that starts at a low point and gradually increases over time. The man gestures with his hands as he speaks, emphasizing points about the graph. The background includes logos for Microsoft, AWS, and SMOF.
Audio Transcript:
You're doing a good job. And then you can see around step 1,200 here, it just like jumps a bunch, right? It's like, okay, it clearly figured something out. I don't know what it figured out, but we should look at that. And so what it turns out what the model had figured out was that it could just completely ignore the content of the post and generate the same title for every single one of them,
Video Content:
A man stands in front of a large screen displaying a graph titled "v2d forward reward". The graph shows a line chart with two lines: one labeled "sample_exp" and another labeled "v2d_exp". The "sample_exp" line starts at a low value and gradually increases over time, while the "v2d_exp" line starts at a higher value and also increases but at a slower rate. The man is speaking and gesturing towards the screen, likely explaining the data presented.
Audio Transcript:
and that would maximize its score. So it generated this title, Google Laylor, like maximize its score. So it generated this title, Google lays off 80% of workforce. Literally every single article, this was what it labeled it as. And the room was like, yes, that is going to get upboat on Hacker News for sure, which it probably would, to be fair. So, so. fair. So anyway, the way we solved this, what we found is that it's really important to watch out for this.
Video Content:
A speaker is giving a presentation on a stage. The background features a large screen displaying text and graphs related to AI and machine learning. The text on the screen includes phrases like "> be me > train agent to title HN posts > agent titles every post " and "Google lays off 80% of workforce [2023]". The speaker is using a laptop and gesturing with his hands as he speaks. The logos of Microsoft, AWS, and SMOF are visible in the bottom corners of the screen.
Audio Transcript:
Solving it typically involves modifying in some way your reward function to penalize things like that. So in the second example I talked about, it was actually quite an easy fix once we identified it, which was just add an extra LMS judge that looked at the title, looked at the content, and said, hey, is there anything in the title that's not supported by the content? And we added that on and it actually worked great.
Video Content:
Reward Hacking Identify: Look at your rollouts! Solve: extra reward terms to penalize hack
Audio Transcript:
The important thing here is you want to be looking at your rollouts, not just blindly trusting the reward function, figuring out what's actually happening. Anyway, so that's it. I'm almost out of time, so I'm going to stop a couple of QR codes for you. Everything in this presentation, and there's a much longer write-up I have of this whole project. It includes the code, it includes the code, it includes the artifacts, data sets along the way.
Video Content:
Reward Hacking Identify: Look at your rollouts! Solve: extra reward terms to penalize hack AIE ART·E Code + Technical Details Kyle Corbitt | @corbitt | kyle@openpipe.ai Microsoft AWS SMX
Audio Transcript:
You can check that out there. One more thing is we have a Discord that's open. We have an open source project for training reinforcement learning models. We have a Discord you can go to. If you're interested in this kind of thing, we're all in there, we answer questions. There's lots of people from the community trying to do these things. So if you're interested in building things with this, feel free to join it. And yeah, happy, happy to chat there. And yes, thank you, you everyone appreciate your time All right.
Video Content:
A speaker is presenting at an event, likely a conference or seminar, discussing topics related to artificial intelligence (AI) and machine learning. The presentation appears to focus on reinforcement learning (RL) and its applications in AI. The speaker is standing in front of a screen displaying text and logos, including "AIE," "ART Discord: Talk RL," and mentions of Microsoft, AWS, and SMOO. There is also a QR code visible on the screen. The speaker is wearing a red shirt and is gesturing as they speak.
Audio Transcript:
And since I am also in my separate hat, the moderator for this track, it is my great pleasure the moderator for this track. It is my great pleasure to welcome to the stage Nathan Lambert. Everyone give a round of applause. Amen. Nathan, if for those of you, I'm sure many of these people know who Nathan is, he's probably the strongest voice in sort of the open ecosystem talking about post-training
Video Content:
Reasoning + RL Sponsored by OpenPype June 9, 2023 Webinar Series 2:00 PM - 2:20 PM How to Train Your Agent: Building Reliable Agents with RL Kyle Costant OpenPype 2:20 PM - 2:40 PM What Reinforcement Learning with Verifiable Rewards Changed Nathan Lambert AIE 2:40 PM - 2:50 PM Towards Verified Superintelligence Christian Szepeszy MIT
Audio Transcript:
generally and also reinforcement learning. He's got a great newsletter he does, as well as he puts that out as a podcast and lots of content on X, but anyway, very excited to see what you'll say. Oh, and also runs a bunch of projects at AI2. They build probably the best fully open models. So anyway, and I'm sure we'll talk more.
Video Content:
Reasoning + RL Sponsored by: OpenPipe THURSDAY, JUNE 3, 2021 WEB SEASON 5 2:00 PM – 2:20 PM How to Train Your Agent: Building Reliable Agents with RL 2:20 PM – 2:40 PM What Reinforcement Learning with Verifiable Rewards Changed 2:40 PM – 2:50 PM Towards Verified Superintelligent Intelligence Christian Szepeszy (MIT) AIE A taxonomy for next-generation reasoning models 6 months into RLRVR – where are we and where we’re going Nathan Lambert Alan Institute for AI Interconnects.ai AI Engineer World’s Fair 5 June 2021 https://www.youtube.com/watch?v=Qc98t7zGKgI&list=PLXJvFkxuqRZwWjyPm0iLlDh0p0d9f0aM&index=1
Audio Transcript:
I really came to this, thinking about trying to reflect on six months into this like reinforcement learning with verifiable rewards post-01 post-deep seek and I think that a lot of this stuff is somewhat boring because everybody has a reasoning model. We all know the basics of you can scale RL at training time and the numbers will go up
Video Content:
A taxonomy for next-generation reasoning models 6 months into RLLVR – where we are and where we’re going Nathan Lambert AI Engineer World’s Fair 6 June 2023 Lambert AI: Next Generation Reasoning Models 1/4
Audio Transcript:
and that's deeply correlated with being able to then do this inference time scaling. But really in AI right now, everybody, there's a lot of people who are up to speed, but the crucial question is like where are things going to go and how do you skate where the puck is going? So a lot of this talk is really me trying to process is like, where is this going besides getting high benchmark scores with using 10,000 tokens per answer
Video Content:
Everybody has reasoning models OpenAI v1.5 DeepSesok R1 Gemma 2.5 Claude 4 w/ Extended Thinking Grok 3 Queen 3 What are we getting out of them besides high benchmarks? What are the next frontiers of training them? Lecture | Open Generalization Reading Series | 3
Audio Transcript:
and like what do we need to do to actually train these models and what are the things that OpenAI, etc. are probably already doing, but it's increasingly hard to get that signal out of them. So if we look at this, like reasoning is really also unlocking really new language model applications. I think this is the same search query,
Video Content:
Everybody has reasoning models OpenAI v3.5 DeepSokel R1 Gomini 2.5 Claude 4 w/ Extended Thinking Grok 3 Queen 3 What are we getting out of them besides high benchmarks? What are the next frontiers of training them? Learner's Open Generative Reading Workshop 8th Edition Microsoft AWS SMOQ
Audio Transcript:
which is like, as an RL researcher, I need to find this all the time. I forget that it's called coastrunners and you Google like over-optimization 20 times to find it. But I tried asking 03 and it literally gave me the download link directly, so I didn't even have to do anything. And that's a very unusual use case to just pop out of this reasoning training where math and code was a real thing to start with.
Video Content:
Reasoning is starting to unlock new LM applications Asking GPT-3 to find a reference that took me 10 minutes to Google the week before. One shot it off! With nice download (link) in 56 seconds.
Audio Transcript:
And 03 is great. It's the model that I use the most for finding information. And this just really is the signal that I have that a lot of new interesting things are coming down the pipe. I would say it's starting to unlock a lot of new language model applications that I use some of these. So this is a screenshot of deep research. It's great. You can use it in really creative ways, like prompt it to look at your
Video Content:
Reasoning is starting to unlock new LM applications Asking GPT-3 to find a reference that took me 10 minutes to Google the previous day. One shot it off with nice download links (in 56 seconds). Lambert: This is Generation 2023, Microsoft.
Audio Transcript:
website and find typos or look at only the material on your website and things like this. It's actually more steerable than you may expect. Claude code, which I describe as just the vibes are very good. It's fun. I'm not a serious software engineer, so I don't use it on hard things, but I use it for fun things because I can. I can put the company API key in and just kind of mess around, like helping me build
Video Content:
Reasoning is starting to unlock new LM applications Researcher: Jeremy B. Ford Speaker: J. A. Smith Lambert's Data Generation Workshop #5 Microsoft AWS SMOQ
Audio Transcript:
my, the website for this book that I wrote online. And then there's the really serious things, which are like codecs and these fully autonomous agents that are starting to come. If you play with it, it's obvious that the form factor is going to be able to work. I'm sure there are people that are getting a lot of value out of it right now. I think for ML tasks, it's like there's no GPUs in it right now. And if you are dealing with open models,
Video Content:
Reasoning is starting to unlock new LLM applications AIE Research: Energy to find more energy Cloudy Code Predictions (unofficial) - Finding the rocket. Lambertian Plane Generation Research Workshop #1 Microsoft AWS SMOKO
Audio Transcript:
it's like they just added internet, so like it wasn't gonna be able to go back and forth and look at like hugging face configs or something and all these headaches that you don't want to deal with. But in the six months, like all of these things are going to be stuff you should be using on a day-to-day basis, so this is all downstream of this kind of step change and performance from reasoning models. And then this is kind of like another plot that's been talked about.
Video Content:
Reasoning is starting to unlock new LM applications Researcher: Every 3-4 weeks, we find new applications. Cloudy Code: We can use the website to find new applications. Lambertian: Our team is working on finding the next one. Microsoft: AWS, Microsoft, and AWS are collaborating on this project.
Audio Transcript:
And when I look at this, it's like through 2024, if we look at like GBT40, things a lot, and really were saturating then, and then there's these new sonnet models in 01, which really helped push out the frontier and time horizon. So this is, the y-axis is how long a task can roughly be completed by the models in time, which is kind of a weird way to measure it because things will get faster, but it's going to keep going, and this reasoning model is the
Video Content:
Autonomy: A defining trend of new reasoning products The length of tasks is doubling every 7 months Microsoft Research Lukasen: Open Governance Research Project 9 Microsoft AWS
Audio Transcript:
technique that was kind of unlocked in order to figure out how to push the limits. And when you look at things like this, it's not that just we're like on a path determined from AI and more gains are going to come. It's really like we have to think about what the models need to be able to do in order to keep pushing out these frontiers. So there's a lot of human effort that goes into continuing the trends of AI progress.
Video Content:
Autonomy: A defining trend of new reasoning products The length of tasks we can do doubling every 7 months Microsoft Research Lukasen: Open Generalization Research Project 9 Microsoft AWS SMOKY
Audio Transcript:
So it's like gains aren't free and I'm thinking that a lot of planning and kind of thinking about training in a bit of a different way beyond just reasoning skills is going to be what helps push this and enable these language modeling applications and products that are kind of in their early stages to really shine. So this is a core question that I'm thinking about,
Audio Transcript:
is like, what do I have to do to come up with the research plan to train reasoning models that can work autonomous and really have meaningful ideas for what planning would be. So I kind of came up with the taxonomy that has a few different, what I call, traits within it. The first one is skills, which we've pretty much already done. Skills are getting really good at mapping code code inference time scaling was useful to getting there but they
Video Content:
A speaker is standing at a podium, delivering a presentation. The background is dark with a logo on the left side that reads "AIE". The speaker is wearing a suit and is gesturing with his hands as he speaks. The screen behind him displays text that reads: "How do we train a reasoning model that can work autonomously 10X longer?" There are logos for Microsoft, AWS, and SMOOCH on the screen. The speaker appears to be explaining the challenges and solutions for training reasoning models that can operate independently for extended periods.
Audio Transcript:
kind of become more researchy over time i think for products calibration is going to be crucial, which is like these models overthink like crazy. So they need to be able to kind of have some calibration to how many output tokens are used relative to the difficulty of the problem. And this will kind of become more important when we're spending more on each task that we're planning. And then the last two are subsets of planning that I'm thinking about and happy to take feedback on this taxonomy, but like strategy,
Video Content:
What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan.
Audio Transcript:
which is just going in the right direction and knowing different things that you can try. It's really hard for these language models to really change course where they can backtrack a little bit, but restarting their plan is hard. And then as tasks become very hard, we need to do abstraction, which is like, the model has to choose on its own how to break down a problem into different
Video Content:
What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan. 4. Abstraction: The ability to break down a strategy into solvable chunks.
Audio Transcript:
things that it can do on its own. I think right now humans would often do this, but if we want language models to do very hard things, they have to make a plan that has sub tasks that are actually tractable or calls in a bigger model to do that for it. But these are things that are, the models aren't going to do natively. Natively, they're trying to, like, doing math problem solving.
Video Content:
What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan. 4. Abstraction: The ability to break down a strategy into solvable chunks.
Audio Transcript:
Like, that doesn't have clear abstraction on like this task I can do and with this additional tool and all these things. So this is a new thing that we're going to have to add. So to kind of summarize, it's like we have skills, we have research for calibration and I'll highlight some of it. But like planning is a new frontier where people are talking about it and we really need to think about like how we will actually put this into the models.
Video Content:
What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan. 4. Abstraction: The ability to break down a strategy into solvable chunks. We largely have this ability A lot of research is underway What is referred to as 'planning'
Audio Transcript:
So actually put this into the models. So to just put this up on the slide, what we call reinforce learning with Verify Biblical rewards looks very simple. I think a lot of RL and language models, especially before you get into this multi-turn setting has been you take prompts, the agent creates a completion to the prompt, and then you score the completions.
Video Content:
What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan. 4. Abstraction: The ability to break down a strategy into suitable chunks. What is referred to as 'planning'.
Audio Transcript:
And with those scored completions, you can update the weights to the model. It's been single turn, it's been very simple. I'll have to update this diagram for multi-turn and tools and it makes it a little bit more complex. But the core of it is just a language model generates completions and gets feedback on it. And it's good to just take time to look at these skills. These are a collection of evals and we can look at like
Video Content:
Reinforcement learning with verifiable rewards (RLVR) Ground Truth Reward Agent π(θ) (Policy) Training data S1 A1 S2 A2 ... Sn An R1 R2 ... Rn A1 A2 ... An Skills: The foundation of reasoning AMR 2023 SPQA-Advanced GPT-4 Microsoft AWS SMOO
Audio Transcript:
where GBT40 was and these were the hardest avals that have existed and were called the frontier of AI. And if we look at the 01 improvements and the 03 improvements in quick succession, these are really incredible avow gains that are mostly just from adding this new type of training in. And the core of this argument is that we need to do something similar if we want planning
Audio Transcript:
to work. So I would say that a lot of the planning tasks look mostly like humanity's last exam and Amy just after adding this reasoning skill and we need to figure out what other types of things these models are going to be able to do. So it's like, this list of reasoning abilities that these kind of low-level skills is going to continue to go up. I think the most recent one, if you look at recent deep seek models or recent Klan models, is really this tool use being added in.
Audio Transcript:
And that's going to build more models like O3. So using O3 just feels very different because it is this kind of combination of tool use with reasoning and it's obviously good at math and code. But I think these kind of low-level skills that we expect from reasoning training are we're going to keep getting more of them as we figure out what is useful. I think an abstraction for the kind
Video Content:
Skills: The foundation of reasoning Reasoning models unlocked with huge increase in benchmark scores: • Inference-time scaling • Better coding & performance • More reliable tool-usage (as growing list) GPT-4s: AI improvements GPT-3.5: AI improvements
Audio Transcript:
of agenticness on top of tool use is going to be very nice, but it's hard to measure. And people mostly say that Claude is the best at that, but it's not yet super established on how we measure it or communicate it across different models. And then that's where I get into the fun, interesting things. I think it's hard for us because calibration
Video Content:
Skills: The foundation of reasoning Reasoning models unlocked with huge increase in benchmark scores: • Inference-time scaling • Better coding & tool performance • More reliable tool use (as growing list) GPT-4s: AI improvements G2D improvements Calibration: Reasoners that try as hard as they need to Effort is currently offloaded to the user: • Model selection between reasoners or traditional instruction models • Reasoning over input vectors • Reasoning over effect vectors Soon the model will know how hard to think.
Audio Transcript:
is passed to the user, which is we have all sorts of things like model selectors if you're a chaty BT user. Claude has reasoning on off with this extended thinking and Gemini has something similar and there's these reasoning efforts selectors in the API. And this is really rough on a user side of things and making it so the model knows this will just really make
Video Content:
Calibration: Reasoners that try as hard as they need to Effort is currently offloaded to the user: • Model selection between reasoners or traditional instruction models. • Reassembling proof buttons. • Reassembling effect vectors. Soon the model will know how hard to think.
Audio Transcript:
it so it's easier to find the right model for the job and just kind of you're kind of over spent tokens for no reason will go down a lot it's kind of obvious to want it, and then it'll just, it becomes a bigger problem, the longer we don't have this. Some examples from when overthinking was kind of identified as a problem. It's like the left half of this is you can ask a language model,
Video Content:
Calibration: Reasoners that try as hard as they need to Effort is currently offloaded to the user: • Model selects between reasoners or traditional instruction models. • Reassembling proof buttons. • Reassembling effort vectors. Soon the model will know how hard to think. AIE Lambert's NLP Seminar Recording Module 23 Microsoft AWS SMOQ
Audio Transcript:
like what is 2 plus 3. And you can see these reasoning models use hundreds to a thousand tokens or something that could realistically be like one token as an output. And then on the right is a kind of comparison of sequence links from a standard non-RRL trained instruction model versus the QWQ thinking model. And you really can gain this like 10 to 100 X in token spend when you shift to a reasoning model. And if you do that
Audio Transcript:
in a way that is wasteful, it's just going to really load your infrastructure and cost. As a user, I don't want to wait minutes for an easy question, and I don't want to have to switch models or providers to deal with that. So I think one of the things that once we start have this calibration is this kind of strategy idea. And on the right, I went to the, I think it's EPAC AI website, took a question,
Video Content:
Calibration: Reasons that try as hard as they need to Still, overfitting is a major problem. Strategy: Reasoning models that go in the right direction There's a large gap between reasoning models and agents built with reasoning models today. Reasoning models themselves do little to plan. Planning agents get promoted to plan. Over time, picking the right one needs to fall nature and core skill.
Audio Transcript:
one of their example questions from Frontier Math. And I was like, does this new DeepSeek R1-0528 model, like doesn't do any semblance of planning when it starts. And you ask you to math problem, it's just like, okay, the first thing I'm going to do is I need to construct a polynomial. It's like, it just goes right in and it doesn't do anything like trying to sketch the problem before things. And this is going to probably
Video Content:
Strategy: Reasoning models that go in the right direction There is a large gap between reasoning models and agents built with reasoning models today. Reasoning models themselves do little planning. Planning agents need to promote their plans. Over time, picking the right agent needs to fall under a core skill.
Audio Transcript:
output 10 to 40,000 tokens and if it's going to need to do another 10x there, it's just like, if that's all in the wrong direction, that's multiple dollars of spend and a lot of latency that's just totally useless. And most of these applications are set up to expect a latency between 1 and 30 minutes. So it's like there is just a timeout they are fighting. So either going in the wrong direction or just thinking way too hard about a sub problem
Video Content:
Strategy: Reasoning models that go in the right direction There is a large gap between reasoning models and agents built with reasoning models today. Reasoning models themselves do little planning. Planning agents are prompted to plan. Over time, picking the right one needs to fall nature and core skill.
Audio Transcript:
is just going to make it so the user leaves. So right now, these models, I said, they do very little planning on their own. But as we look at these applications, they're very likely prompted to plan, which is like the beginning of deep research in cloud code. And we kind of have to make it so that is model native rather than something that we do manually. And then once we look at this plan, there's all these implementation details across something like deep research or codex, which is like, how do I manage a memory?
Video Content:
There is a large gap between reasoning models and agents built with reasoning models. Right now, reasoning models themselves do little planning. Reasoning agents get prompted to plan. Over time, picking the right agent needs to fall under a core skill. Abstraction: Reasoning models that break down a task. Questions for designing an LLM that orchestrates its own plans: How should an LLM manage its memory? How can an LLM avoid repeating the same mistakes? How can an LLM self-orchestrate by breaking down a plan into parts to solve on its own? How does an LLM offload more thinking (e.g., parallel computing) to the hardest sub-tasks? How can an LLM work on multiple sub-tasks in parallel?
Audio Transcript:
So we have Claude Code compresses its memory when it fills up its context window. We don't know if that's the optimal way for every application. We want to avoid repeating the same mistakes. We talked to, Greg was talking about the playing Pokemon earlier, which is a great example of that. We want to have tractable parts. We want to offload thinking if we have a really challenging part.
Video Content:
Abstractation: Reasoning models that break down a task Questions for designing an LLM that orchestrates its own plans: • How should an LLM manage its memory? • How can an LLM avoid repeating the same mistakes? • How can an LLM make sure it breaks down a plan into parts it can solve on its own? • How can an LLM offload more thinking (e.g., parallel computing) to the hardest sub-tasks? • How can an LLM work on multiple sub-tasks in parallel?
Audio Transcript:
So I'll talk about parallel compute a little bit later. It's a way to kind of boost through harder things. And really, we want language models to call multiple other models in parallel. So right now people are spinning up T-MUx and launching Claude Code in 10 Windows to do this themselves, but there's no reason a language model can't be able to do that.
Video Content:
Abstractation: Reasoning models that break down a task Questions for designing an LLM that orchestrates its own plans: • How should an LLM manage its memory? • How can an LLM avoid repeating the same mistakes? • How can an LLM make sure it breaks down a plan into parts it can solve on its own? • How can an LLM offload more thinking (e.g., parallel computing) to the hardest sub-tasks? • How can an LLM work on multiple sub-tasks in parallel?
Audio Transcript:
It just needs to know the right way to approach it. And as I've started with this idea of kind of we need effort for, or like we need to make effort to add new capabilities into language models. When you, when I think about this kind of story of Q-Star that became strawberry, that became 01,
Video Content:
Abstractation: Reasoning models that break down a task Questions for designing an LLM that orchestrates its own plans: • How should an LLM manage its memory? • How can an LLM avoid repeating the same mistakes? • How can an LLM make sure it breaks down a plan into parts it can solve on its own? Bootstrapping training data for planning Q2: Strawberry, are we finally 1 month due? It's 12 months due so the need to create training data to seed models with reasoning skills (backtracking, ventilation, etc.) Planning will go through a similar arc, but it will be easier to audit. Finally, RL can reinforce useful planning styles.
Audio Transcript:
the reason that it was in the news for so long and was such a big deal is like it was a major effort for open AI spending like 12, 18 months building these initial reasoning traces that they could then train an initial model on that has some of these behaviors. So it took a lot of human data to get things like backtracking and verification
Video Content:
Bootstrapping training data for planning Q2: Strawberries are finally on sale. It's months due as the need to create training data to seed models with reasoning skills (backtracking, ventilation, etc.). Planning will go through a similar arc, but it will be easier to audit. Finally, RL can reinforce useful planning styles.
Audio Transcript:
to be reliable in their models. And we need to go through a similar arc with planning, but with planning, the kind of outputs that we're going to train on are much more intuitive than something like reasoning. I think if I were to ask you to sit down and write a 10,000 token reasoning trace with backtracking. It's like you can't really do this, but a lot of expert people can write a five to 10 step plan that is very good or check the work of Gemini or Open AI when asked
Video Content:
Bootstrapping training data for planning Q2: Strawberry, are you finally on track? 12 months due as the need to create training data to seed models with reasoning skills (backtracking, ventilation, etc.). Planning and going through a similar arc, but it will be easier to audit. Finally, RL can reinforce useful planning styles.
Audio Transcript:
to write an initial plan. So I'm a lot more optimistic on being able to hill climb on this. And then it goes through the same path where once you have initial data you can do some sfti and then the hard question is if the rl and even bigger tasks can reinforce these planning styles. On the right I added kind of a hypothetical, which is like we already have thinking tokens before answer tokens, and there's
Video Content:
Bootstrapping training data for planning Q2: Strawberry, are we finally on track? 12 months due at the need to create training data to seed models with reasoning skills (backtracking, verfication, etc.) Planning and going through a similar arc, but it will be easier to audit. Finally, RL can reinforce useful planning styles.
Audio Transcript:
no reason we can't apply more structure to our models to just really make them plan out their answer before they think. So to give a bit more depth on this idea of skill versus planning. If we go back to this example, I would say that 03 is extremely skilled at search.
Video Content:
Bootstrapping training data for planning Q2: Strawberry.py finally on track. 12 months due to the need to create training data to seed models with reasoning skills (backtracking, verification, etc.) Planning will go through a similar arc, but it will be easier to audit. Finally, RL can reinforce useful planning styles.
Audio Transcript:
So being allowed to find a piece of niche information that researchers in a field know of, but can't quite remember the exact search words, that is an incredible skill. But when you try to put this into something like deep research, there's this lack of planning is making it so that sometimes you get a masterpiece and sometimes you get a dud. And as these models get better at planning, it'll just be more thorough and reliable in getting
Video Content:
Revisiting this example Very skillful, lacking planning Search is a skill that often has taken a massive leap into Synthesizing complex information and comparisons requires better planning The classic AI optimization game Cloud9 Research
Audio Transcript:
the kind of coverage that you want. So it's like, it's crazy that we have models that can do this search, but if you ask it to recommend some sort of electronics purchase or something. It's really hard to trust because they can't just know how to pull in the right information and how hard it should try to do all that coverage.
Video Content:
Revisiting this example Very skillful, lacking planning Search is a skill that often has taken a massive leap over time. Synthesizing complex information and comparisons requires better planning.
Audio Transcript:
So we kind of summarize, these are the four things I presented. I think you can obviously add more to these. You could call a mix of strategy and abstraction. There's like, you could call what I was describing as like context management in many ways. But really, just want to have things like this so that you can break down the training problem
Video Content:
Revisiting this example Very skillful, lacking planning Search is a skill that has taken a massive leap into Synthesizing complete information and comparisons requires better planning What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan. 4. Abstraction: The ability to break down a strategy into solvable chunks.
Audio Transcript:
and think about data acquisition or new algorithmic methods for kind of each of these tasks. And I mentioned parallel compute because I think this is an interesting one because if you use O1 Pro, it's still been one of the best models and the most robust models for quite some time, and I've been very excited for O3 Pro. But it doesn't solve problems in the same way as traditional inference time scaling, where
Video Content:
What reasoning models for independent agents need 1. Skills: The ability to solve self-contained problems. 2. Calibration: The ability to understand the difficulty of a problem and not overthink. 3. Strategy: The ability to choose the right high-level plan. 4. Abstraction: The ability to break down a strategy into solvable chunks.
Audio Transcript:
inference time scaling just made a bunch of things that didn't work go from zero to one. Where this parallel compute is really like, it makes things more robust, it just makes them nicer. And it seems like this kind of RL training is something that can encourage exploration. And then if you apply more compute and parallel, it feels something kind of exploiting and getting a really well-crafted answer. So there's a time when you want that, but it doesn't solve every problem.
Video Content:
Parallel compute as amplification of reasoning abilities Parallel compute and better verifiers increase the slope of reference time scaling in practice improve the robustness of answers (e.g., 670 ms). Claudio's (Google) Thinking Machine, 2020. Leveraging Large-Scale Reasoning Datasets.
Audio Transcript:
And to kind of transition into the end of this talk, it's like there's been a lot of talks today saying the things that you can do with RL. And there's obviously a lot of talk on the ground of what is called continual learning. And if we're just continually using very long horizon RL tasks to update a model and diminish the need of pre-training. And there are a lot of data points that were closer to that in many ways.
Video Content:
A speaker is standing at a podium, presenting on a topic related to language model development. The background is dark, and there is a logo of Microsoft AWS and SMOOCH on the screen. The speaker is wearing a light-colored shirt and is gesturing with his hands as he speaks.
Audio Transcript:
I think continual learning has a big algorithmic bottleneck where but just like scaling up RL further is very tractable and something that is happening. So if people are to ask me what I'm working on at AI2 and what I'm thinking about, this is my rough summary of what I think a research plan looks like to train a reasoning model
Video Content:
A person is standing at a podium, speaking into a microphone. The background is dark, and there is a logo on the left side of the screen. The text on the screen reads "AIE" and "RL as the focal point of language model development." There is also a mention of "What I'm thinking about for scaling RL." The speaker appears to be discussing the challenges and considerations involved in scaling reinforcement learning (RL) for language model development.
Audio Transcript:
without without all the in between the line details. So step one is you just get a lot of questions that have verified answers across a wide variety of domains. Most of these will be math and code because that's what is out there. And then two, if you look at all these recipe papers, they're having a step where they filter the questions based on the difficulty with respect to your base model.
Video Content:
What I'm thinking about for scaling RL 1. Get a big, multi-domain dataset of questions + answers 2. Difficulty: Filtering - Not too easy, not too hard but starting checkpoint 3. Let's try for a long time 4. Try to filter the domain RL tricks for last five points (overwearing filtering, bias sided clipping, meeting relevance model, Dr. GRPOO advantage estimation...)
Audio Transcript:
So if a question is solved zero out of 100 times by your base model or 100 out of 100, you don't want questions that look like that because you're both not only wasting compute, but you're messing up the gradients in your RL updates to make them a bit noisier. And once you do that, you just want to make a stable RL run that'll go through all these questions and have the numbers keep going up.
Video Content:
What I'm thinking about for scaling RL 1. Get a big, multi-domain dataset of questions + answers 2. Difficulty: Filtering - Not too easy, not too hard from starting checkpoint 3. Let's try for a long time 4. Try to filter the random RL tricks for last five points (overwritting filtering, less sided clipping, meeting reference model, Dr. GRPOO advantage estimation...)
Audio Transcript:
And that's the core of it, is really stable infrastructure and data. And then you can tap into all these research papers that tell you to do methods like overlong filtering or different clipping or resetting the reference model. And that will give you a few percentage points on the top, where really it's just data and stable infrastructure. And this kind of leads to the provocation, which is like, what if we renamed post-training
Video Content:
What I'm thinking about for scaling RL 1. Get a big, multi-domain dataset of questions + answers 2. Difficulty: Filtering. Not too easy, not too hard from start/endpoint 3. Let's try for a long time 4. Try to filter on the last few sessions (overlapping filtering, two-sided clipping, meeting relevance model, Dr. GRPOQ advantage estimation...)
Audio Transcript:
as training? And if OpenAI 01 was like 1% of compute is post training relative to pre-training. They've already said that 03 has increased it by 10x. So if the numbers started at 1%, you're very quickly getting to what you may see as like parity and compute in terms of GPU hours between pre-training and post-training,
Video Content:
From "post-training" to "training" As we have already trained on more or less the whole internet, interest in RL training on more and more domains will grow. How much further will the compute used in "post-training" grow? Will "continual learning" work and further reduce pretraining?
Audio Transcript:
which if you were to take anybody back a year ago before 01, would seem pretty unfathomable. And one of the fun data points for this is that the DeepSeek V3 paper, and you kind of watch DeepSeek's transition into becoming more serious about post-training. Like the original deep-seek-read-through paper,
Video Content:
From "post-training" to "training" As we have already trained on more or less the whole internet, interest in RL training on more and more domains will grow. How much further will the compute used in "post-training" grow? Will "continual learning" work and further reduce pretraining? DeepSeek V3 used 19% of compute on post-training. DeepSeek V3 pretraining took <2 months. DeepSeek R1 RL training took "a few weeks". DeepSeek R1 could already be >30% of compute in GPU hours. Scaling R1 has just begun. Training costs of DeepSeek V3, assuming the rental price of 1000$ per GPU CPU hour.
Audio Transcript:
they use 0.18% of compute on post-training in GPU hours. And they said their pre-training takes about two months and there was a deleted tweet from one of their RL researchers that said the R1 training took a few weeks. So if you make a few very strong, probably not completely accurate assumptions that RL was on the whole cluster, that would already be 10 to 20% of their compute, I think.
Video Content:
From "post-training" to "training": DeepSeek V3 used 19% of compute on post-training. DeepSeek V3 pretraining took <2 months. DeepSeek V3 RL training took "a few weeks". DeepSeek R1 could already be >30% of compute in GPU hours. Scaling RL has just begun. Training times: DeepSeek V3, assuming the rental price of $800/hour on GPU hours. DeepSeek V3: 15,000 hours, $12,000 DeepSeek R1: 12,000 hours, $12,000 Total: 27,000 hours, $24,000
Audio Transcript:
Like specific things for DeepSeek are like, oh, their pre-training efficiencies, probably way better than their RL code and things like this. But scaling RL is a very real thing. If you look at this, if you look at Frontier Labs, and you look at the types of tasks that people want to solve with these long-term plans. So it's good to kind of embrace what you think these models will be able to do and kind of break down tasks on their own and solve some of them.
Video Content:
From "post-training" to "training": DeepSeek V3 used 19% of compute on post-training. DeepSeek V3 pretraining took <2 months. DeepSeek R1 training took "a few weeks". DeepSeek R1 could already be >30% of compute in GPU hours. Scaling RL has just begun. Training costs of DeepSeek V3, assuming the rental price of 1000$ per GPU CPU hour.
Audio Transcript:
So thanks for having me and let me know what you think. All right. So for our final talk of the track for the day. We have the great pleasure of hearing from Chris Segetty.
Video Content:
From "post-training" to "training": DeepSeek V3 used 19% of compute on post-training. DeepSeek V3 pretraining took <2 months. DeepSeek V3 RL training took "a few weeks". DeepSeek R1 could already be >30% of compute in GPU hours. Scaling RL has just begun.
Audio Transcript:
So Chris was a researcher at Google for a long time, became the co-founder of XAI, where he was for a couple of years, and is now working on a new startup. He's going to be talking to us today about the path towards verified superintelligence, which I'm very curious about myself. Unfortunately, Chris did have a last minute conflict, so he's not able to be here in person.
Video Content:
Reasoning + RL is a series of talks sponsored by OpenPipe. The event took place on June 3, 2020, at Web Scale 5. The schedule included three talks: "How to Train Your Agent: Building Reliable Agents with RL" by Kyle Corzatt, "What Reinforcement Learning with Verifiable Rewards Changed" by Nathan Lambert, and "Towards Verified Superintelligence" by Christian Szepeszy. The talks were presented in a virtual format.
Audio Transcript:
He will be here on Zoom. So we should see in just a moment him coming up and be able to see him talk. So let's welcome Chris. Thank you. I see people in the AV corner just looking very, okay, we're good Thank you. I'm okay Thank you. Thank you. Thank you. Thank you. Thank you. I feel like we've all been here with Zoom before.
Video Content:
A speaker is standing at a podium, delivering a presentation. The background is dark, and there is a logo of Microsoft AWS and SMOJ prominently displayed on the screen behind the speaker. The speaker appears to be addressing an audience, possibly at a conference or event related to AI and technology.
Audio Transcript:
So we'll give Chris a moment. Sorry, I just had to give some permission to Zoom. Somehow it didn't have a permission to share my video the one. We need to empathize. I think we've, this has happened to everyone on some meeting. Yeah, okay, okay, okay.
Video Content:
World's Fair
Audio Transcript:
Okay, cool. So can you see my presentation slide? Yes, we can see it. So yeah, I'm Christian Segety and I just turn it more flex as a chief scientist.
Video Content:
World's Fair
Audio Transcript:
So I can tell a bit about why or why I believe in what we are doing and why I think that technology will allow us to take to the next level but first let me give some overview on how i view the progress of the in the past 14 years.
Video Content:
World's Fair
Audio Transcript:
So basically, and when I started doing deep learning that was around 2011, and classification was still a bit better mark so nobody really knew whether it will ever take on or not so nobody would make it it was like Google brain was like for people and so so when Alexnet came out,
Video Content:
World's Fair
Audio Transcript:
so maybe I just put up a few examples of what I believe in this transfer, like, like how they, what is a good example of those fans? Then structured prediction became a big research topic. So object detection was a large part of it so that people get excited because it was an important application domain. So, so then we learned like in 2015 that, yeah, we can do object detection, we can do object detection we can do almost any
Video Content:
Past Trends in AI Supervised AI via self-play Unsupervised/Pretraining Deep RL via self-play Algorithms Structured prediction Object Detection Supervised Classification Summary: 2012-2024/2026
Audio Transcript:
such that can be output by neural networks then Aria became a big trend of the most people are most excited about the right of the soul so if the alpha is probably the most important landmarking is probably the most important landmark in that domain, but also got out by open air.
Video Content:
Past Trends in AI Supervised AI Unsupervised/Pretraining Deep AI via self-play Algorithms Structured prediction Object Detection Supervised Classification 2012-2026
Audio Transcript:
But in the meantime, also, transformments came out, and then unsupertendipers started to become the biggest driving force in AI. So it's a special case of structure position in some sense, but still it was taken to the next level by just clearing of compute and data to one was insane amounts like training on almost all of internet accessible data and that's where we are wrapped with and thinking models came out like two years ago they are relatively new and then I think the big idea behind them is that you have to create this
Video Content:
Past Trends in AI Supervised & Unsupervised Thinking Models: Supervised AI, Unsupervised Pretraining Deep AI via self-play: Alphabets, Structured prediction, Object Detection Supervised Classification: ImageNet
Audio Transcript:
side information that is the chain of thoughts and that is basically the learning from supervised that from supervised IRA that that reinforcement learning on large language models is really hard, but if you can, if you're reinforcing the chain of salt, or there is an ancient chain, then you can get suddenly make it work. It was still a lot of work to make it work. It was still a lot of work to make it
Video Content:
Past Trends in AI Supervised AI: Chain of Thought Models Unsupervised/Pretraining: We can learn everything Deep AI via self-play: Deep learning can create anything Structured prediction: Object Detection Supervised Classification: ImageNet
Audio Transcript:
a kickside of a process, so it was not obvious, but the start paper was a prototype on showing that this can be fixed and then, based on that idea, almost all of the current reason is model. So basically the lessons that we could draw from all these trends in the first lesson was in the first
Video Content:
Lessons Learned Supervised Q&A: Chain of Thought Matters We Can Learn Everything Unsupervised Pretraining: Deep AI via self-play Structured prediction: Deep Learning can create Supervised Classification: Deep Learning works
Audio Transcript:
effort if the first ever case that deep learning works at all, then the development can create artifacts. Then the opposite Then, yeah, there, alpha zero. Thank you. We have a very important
Video Content:
Lessons Learned Supervised Q&A: Chain of Thought Matters We can learn everything Unsupervised Pretraining: Deep AI via self-play Structured prediction: Deep learning can create Supervised Classification: Deep Learning Classifiers
Video Content:
Lessons Learned Supervised Q&A: Chain of Thought Matters We Can Learn Everything Unsupervised Pretraining Deep RL via self-play Structured prediction Supervised Classification 2012-2014 2016-2018 2020-2022 2024-2026 What is Next? ??? Limitations Supervised Q&A: Verifiable environments Unsupervised Data Deep RL via self-play: Unlabeled types of problems Structured prediction: Labeled data Supervised Classification: Very narrow usefulness 2012-2014 2016-2018 2020-2022 2024-2026
Audio Transcript:
that chain of all so matter matters so you can scale up influence now essentially now the question is what is the next big trend that we can expect to happen and that's what i would like to talk about but in order to guess it, let's see what were the limitations of each of these trends. So first we have like, you need to have label data, so basically we needed a lot of human label as to label at certain data points. Then structure prediction took that effort into an X-level because my own objects is harder than just later in the pictures. Then deep RL was a huge success. Thank you. Thank you. Thank you. What are the buttons to AIS in proof? next to AI self-incruement.
Video Content:
Lessons Learned Supervised Q&A: Chain of Thought Matters We can learn everything Unsupervised Privatizing Deep AI via self-play: Structured prediction Structured prediction: Supervised Classification Limitations Supervised Q&A: Verifiable environments Unsupervised Privatizing: Data-driven problems Deep AI via self-play: Structured prediction Structured prediction: Supervised Classification Labelled data: Very noisy usefulness 2012-2014 2016-2018 2020-2022 2024-2026
Audio Transcript:
So I mean it's like it's the human-generated labels, the human-generated data, and now we have the human-generated tasks and human supervision and even human-generated verifiable programs. So in order to scale up reinforcement learning, you need to generate a lot of environments. So basically what we can say is that the
Video Content:
Bottlenecks to AI Self-Improvement Human-generated Labels Human-generated Data Human-generated Tasks Human-generated Environments Human supervision
Audio Transcript:
Packernect to AI self-improvement is basically human. But every aspect of the learning process that relies on too many inputs is prone to be problematic. It asks for either it adds friction or there is a limited source of data because human data is getting less and less available that we haven't trained on.
Video Content:
Bottlenecks to AI Self-Improvement Human
Audio Transcript:
So what we really have to figure out how we create open-ended surfing program. So that is a hard question with the previous talk, but made them also pointing this out that that's something is the future, but the real question is, how do we really scale up an error so that it will become, so basically
Video Content:
Bottlenecks to AI Self-Improvement Human What does the machine want? Open-ended self-improvement
Audio Transcript:
a combination of large language model structured relation and really superhuman reasoning in almost every aspect. So how do we get to that so what is the way to get to reach self-income so we basically we have to remove these bottlenecks. We don't want to, not let me run out of all this human-generated data from the Internet,
Video Content:
What does the machine want? Open-ended self-improvement Bottlenecks to AI Self-Improvement Human Human-generated Labels Human-generated Data Human-generated Tasks Human-generated Environments Human supervision
Audio Transcript:
now the question is, now we are running into the problem of having one very limited set of human-generated environments that we can train on. We already ran out of them. So people hire a lot, a lot of people, and it takes a lot of money and effort to create this environment and create tasks for the AI that can share you learn.
Video Content:
Bottlenecks to AI Self-Improvement Human-generated Labels Human-generated Data Human-generated Tasks Human-generated Environments Human supervision
Audio Transcript:
So, in short, you can say that the AI needs compute, it involves agency, it needs challenges, and it needs feedback, basically a verification signal that gives a reward to the agent. So these are the most important part of making open-ended self-incorment work.
Video Content:
What does the machine want? Open-ended self-improvement - Compute - Agency - Challenges - Feedback
Audio Transcript:
So from this basic principles, we can make some claims of how we can get, how we can stop this battle neck. So one I think is the first battleck that almost everybody is concerned nowadays is having good environments
Video Content:
The Next Generation of AI - will be executed safely (strong sandboxing) .
Audio Transcript:
in which the AI cannot safely. So it doesn't have to be, so we don't want the AI to just go into some environment and get full latency and then corrupt our computer or product your cloud environment and that's all. So there is a tension between agency and same thing. Because if the AI is very, and same thing.
Video Content:
The Next Generation of AI - will be executed safely (strong sandboxing) .
Audio Transcript:
Because if the AI is very an alien, get full agency, then it then it can be dangerous or it can create itself. So it can basically, we want any environment in the in which the agent can have absolutely fluid, so it can have root access or even web access.
Video Content:
The Next Generation of AI - will be executed safely (strong sandboxing) - access to all environments in which it is fully agency; - repeat until websuccess; - that can be efficiently checkpointed and branched; - explore different actions and revert then when needed.
Audio Transcript:
And we also want the agent to be able to backtrack its decisions even if it made it installed the wrong package for example for a task where it created the wrong program where it deleted an important tool from the 2 chain. So in this case, the agent should be able to decide, okay, I want to go back and I want to try a different path.
Video Content:
The Next Generation of AI will be executed safely (strong sandboxing) access to all environments in which it has full agency; repeatable and web-succesful; that can be efficiently checkpointed and branched; explore different actions and revert then when needed.
Audio Transcript:
So this is not just important for inference time when we execute the agent it's even more important for reinforcement learning where you are running the agent on a lot of problems. So basically, getting a standboxing environment that allows you to do a lot of branchings and undoes basically like a version control system
Video Content:
The Next Generation of AI will be executed safely (strong sandboxing) access to all environments in which it has full agency; repeat itself indefinitely; that can be efficiently checkpointed and branched; explore different actions and revert then when needed.
Audio Transcript:
uh for snapshots is an absolute mast for a sophisticated agent in my opinion. So the other important bottleneck that we have seen is the lack of good verifiable problems. what i think the next generation of a i will need to be able to create its own
Video Content:
The Next Generation of AI will be executed safely (strong sandboxing) access to all environments in which it has full agency; repeat itself indefinitely; that can be efficiently checkpointed and branched; explore different actions and revert then when needed. create its own curriculum targeted at high-level problems.
Audio Transcript:
curriculum of problems so you don't you don't want the agent to be given a set of problems to train on. We should just give them high level task of let's do learn to be a good last programmer for example and then it should go to the internet and find all the important. So improve its own skills without having a set curriculum.
Video Content:
The Next Generation of AI will be executed safely (strongly sandboxing) access to an environment in which it has full agency: repeat itself on webexercises: that can be efficiently checkpointed and branched explore different actions and revert then when needed. create its own curriculum targeted at high-level problems.
Audio Transcript:
So, and we want the agent to be have this ongoing continuous learning as Nathan also mentioned. I think that's very important and that's the future of AI that is being able to do the learning while it's actually executing useful tasks as well. So I think that will be very important in the next few years.
Video Content:
The Next Generation of AI - Will be executed safely (strong sandboxing) - Access to an environment in which it has full agency; - Rapid self-updating - That can be efficiently checkpointed and branched - Explore different actions and revert then when needed. - Create its own curriculum targeted at high-level problems - Learn indefinitely on a growing set of self-paced problems.
Audio Transcript:
But most importantly, and that's why I put such an emphasis on verification and verified AI is that we need need verify a very strong verifiers and validator so what do I mean by that? So verifier is an agent that verifies the correctness of a solution.
Video Content:
The Next Generation of AI Will be executed safely (strongly sandboxing) Access to an environment in which it has full agency Repeat an webaccess Be able to efficiently checkpointed and branched Explore different actions and revert when needed. Create its curriculum targeted at high-level problems Learn indefinitely on a growing set of self-posed problems Will be its own verifier and validator.
Audio Transcript:
But validator is also, it could be the same agent, it validates that the problem formulation was interpreted correctly so that the human input was correctly interpreted so was correctly interpreted. So the first one actually could be done with also external verifier. You don't even need models for that. It can be an agent that just
Video Content:
The Next Generation of AI Will be executed safely (strongly sandboxing) Access to an environment in which it has full agency Repeat on web access Be able to efficiently checkpointed and branched Explore different actions and revert when needed Create its curriculum targeting at high-level problems Learn indefinitely on a growing set of self-posed problems Will be its own verifier and validator
Audio Transcript:
uses some kind of verify that, for example, runs a code or cause a proof checker so doing some formal verification etc but validator should be model based because that checks the alignment between the human and the machine. And that's very important. So, and then we think that the next generation of AI
Video Content:
The Next Generation of AI will be executed safely (strong sandboxing) access to an environment in which it has full agency; repeat an webaccess; repeat actions can be efficiently checkpointed and branched; explore different actions and revert then when needed. create its curriculum targeting at high-level problems; learn indefinitely on a growing set of self-posed problems; will be its own verifier and validator; will produce safe and independently verifiable artifacts.
Audio Transcript:
will be able to produce safe and independently verifiable artifacts. So we want code that is actually comes with its own guarantees of correctness and that's very important so today computer chips already always verified extensively and we don't do that with software.
Video Content:
The Next Generation of AI - Will be executed safely (strong sandboxing) - Access to all the environments in which it has full agency - Repeatable and verifiable - Can be efficiently checkpointed and branched - Explore different actions and revert when needed - Create its curriculum targeting at high-level problems - Learn indefinitely on a growing set of self-paced problems - Will be its own verifier and auditor - Will produce safe and independently verifiable artifacts
Audio Transcript:
But I think in the next five years, the AI-driven cyber attacks will be so severe that nobody will ever try to run any software that is not fully formally verified at least in some for some purposes so and while the AI produces verifiable artifacts, so basically it shouldn't just produce code,
Video Content:
The Next Generation of AI • Will be executed safely (strong sandboxing) • Access to all the environments in which it has full agency • Repeat until success • Can be efficiently checkpointed and branched • Explore different actions and revert when needed • Create its curriculum targeted at high-level problems • Learn indefinitely on a growing set of self-posed problems • Will be its own verifier and auditor • Will produce safe and independently verifiable artifacts • Will improve its own alignment
Audio Transcript:
it should produce code and proof of correctness of that code as well. And at the same time it will have to improve its own alignment. So it should get better and better at respecting and guessing the human's intent as well. So that's what basically I believe is this self-supervised enforcement learning or what I call is verified
Video Content:
The Next Generation of AI - Will be executed safely (strongly sandboxing) - Access to all environments in which it has full agency - Rapidly adapt to new web services - Be efficiently checkpointed and branched - Explore different actions and revert when needed - Create its own curriculum targeting high-level problems - Learn indefinitely on a growing set of self-paced problems - With a self-updating meta-optimizer - Will produce safe and independently verifiable artifacts - Will improve its own alignment
Audio Transcript:
super intelligence and they sound different but actually they are the same thing. I think this is possible. So we can get to self-supervised reinforcement learning, which will be such a big leap as from supervised learning, I just like training on labor data to going to training on the whole internet.
Video Content:
What is Next? 2012-2014: Supervised Classification 2015-2016: Deep RL via self-play 2017-2018: Structured prediction 2019-2020: Object detection 2021-2022: Image Recognition 2023-2024: Unsupervised Pretraining 2025-2026: SAT+ Autonomous Agents 2027-2028: Reasoning with LLMs
Audio Transcript:
So, so this will be, this will be in my opinion the next big step. An agent that can create its own problems. It acts as its own verifier and it will bootstrap itself from all of internet or whatever part of internet you want but it will be autonomously finding its own problems it creates its own problems and
Video Content:
What is Next? Self-Supervised Agents Supervised QL Unsupervised Pretraining Deep RL via AlphaZero Structured prediction Object detection Supervised Classification Image Recognition
Audio Transcript:
solve them so we don't really need to give it any more problems. So basically how do we get to this infinite supply of problems? So that's the hard question. And what we think is the key to that is verification. So we basically we need what I'm suggesting is that we mind the internet for actual problems but turn them into formally verifiable artifacts and formally verified meaning that the correctness of the artifacts can be checked.
Video Content:
What is Next? AI needs an infinite supply of Problems Verification Correctness Alignment Deep RL via self-play Structured prediction Object detection Supervised Classification Image Recognition
Audio Transcript:
So for example, code or mathematical problems. You can use a CRM Prover, for example, to check the logical correctness of things, which is definitely hard, but not impossible. It's getting more and more possible. And at the same time, it uses these signals to create rewards for alignment. So when, for example, an artifacts, it gets from the internet, for example,
Video Content:
AI needs an infinite supply of Problems Verification Correctness Alignment
Audio Transcript:
code this or that problem, and then it tries to code it up and read some solution. It can say, okay, does this solution really aligns with what was meant here. And I think this can be bootstrap with various reward signals coming from cycle consistency ideas.
Video Content:
AI needs an infinite supply of Problems Verification Correctness Alignment
Audio Transcript:
So these can be, in my opinion, also trained and enforced. So correctness and alignment has this interplay between them. So because if the output is not correct, then it cannot be properly aligned. So, and what I really believe is that we can do most of the correctness checking model free so that it doesn't require necessarily a model that that checks the output it's more like a well-defined verifiable after fact like in a NP problem that you have was always a verifier.
Video Content:
AI needs an infinite supply of Problems Verification Correctness Alignment Model-free Model-based
Audio Transcript:
Alignment must be model based unfortunately but i think the interplay between the two allows us to improve on both fronts. And I think verification should be always agentic. So we want to move away from having like a model train to verify something. The verifier also should have tool access and being able to go into an environment where the artifact was produced, for
Video Content:
AI needs an infinite supply of Problems Verification Correctness Model-free Alignment Model-based Agentic
Audio Transcript:
example, a software system and then test that software system in all kinds of scenarios for example. So basically this is a very rough sketch of how I believe this roadmap to work out is that we have like a generator agent and a validator a
Video Content:
AI needs an infinite supply of Problems Verification Correctness Model-free Agentic Alignment Model-based Agentic Verified Superintelligence Generator Potential Tasks Independently Verifiable Artefacts
Audio Transcript:
verifier agent the same agent the agents can say share the same model but they all connected to something like the Morph Cloud that we are working on. This cloud allows you to branch a lot of sandes, tried various versions, rollback, etc. And then the generator
Video Content:
Verified Superintelligence Generator Potential Tasks Independent Verification Verifiable Artefacts
Audio Transcript:
provides solutions to the verifier and the verifier and validator give back reward signals to train the generator. So they both give each other basically kind of reward signals. And the potential tasks are taken from the web and they also should be agentically generated or crawled by the agent itself, not by some human that selects them or creates them.
Video Content:
Verified Superintelligence Generator Potential Tasks Independent Verification Verifiable Artifact
Audio Transcript:
So they should be automatically created. And then the important part is that we want the agent to be able to produce artifacts that are verified completely independently. So you don't want to trust the AI, you want to verify it. So these are the main properties that I really want is that we have first class verification
Video Content:
Verified Superintelligence Generator Potential Tasks First-class verification First-class alignment Model-free verifiable artifacts Infinite supply of hard problems Independent verification Verifiable artifacts
Audio Transcript:
in this system. We have first class alignment in this system. So these are grounding principle of how to build a system and they should be from the ground up provided. So they should be goals. from the ground up provided. So they should be goals of the agent that we are reinforcing the whole time. And we want to get model-free verifiable artifacts so that we can trust the AI without any AI intervention.
Video Content:
Verified Superintelligence First-class verification First-class alignment Model-free verifiable artifacts Infinite supply of hard problems
Audio Transcript:
These proofs might be created by a lot of AI computations, but they should be just verifiable and then this will give us infinite supply of hard problems. So what are the application domains of this verified super intelligence?
Video Content:
Verified Superintelligence First-class verification First-class alignment Model-free verifiable artifacts Infinite supply of hard problems AI, SMEs, Cyberscience, Engineering, Sciences
Audio Transcript:
So I think it will be mandatory for AI software engineers. So if you make one bug in 100 lines of code, that's today, roughly, it's still you get some value out of your model. But if you want to create like agency that the model can work for hours on a complicated software projects completely autonomously, then one bug every
Video Content:
Verified Superintelligence First-class verification First-class alignment Model-free verifiable artifacts Infinite supply of hard problems AI, SMEs, Cyberscience, Engineering, Sciences
Audio Transcript:
hundred lines is impossible, then you will never get anything done. So therefore I think we need to have ongoing verification all the time in the process. Another important problem is cyber security. We want our systems to be hacking proof, so we don't want external AIs to hack in our systems, which will be more and more happening in the next few years.
Video Content:
Verified Superintelligence First-class verification First-class alignment Model-free verifiable artifacts Infinite supply of hard problems AI, SMEs, Cyberscience, Engineering, Sciences
Audio Transcript:
Also, engineering, also engineering so engineering requires a lot of mathematics and reasoning and sciences of course also we require this. So we believe that mathematics will also be properly verified in the next few years, mostly by AI agents. So the goal here is that verified super intelligence,
Video Content:
Verified Superintelligence First-class verification First-class alignment Model-free verifiable artifacts Infinite supply of hard problems AI, SMEs, Cyberscience, Engineering, Sciences Trustlessly Aligned AI
Audio Transcript:
we want a trustlessly aligned AI. This means that we don't just want to trust the AI, we want to actually check it. And this check should be possible. So that's a OK, sorry. Yeah.
Video Content:
Verified Superintelligence Trustlessly Aligned AI
Audio Transcript:
So that concludes my talk. So I'm just, yeah. So I'm happy to take questions. Thank you so much, Chris. We appreciate you taking the time. All right. Well, this concludes our track. This concludes the breakout sessions for today. Yes. I think, yeah, I don't think we'll be able to do questions
Video Content:
Verified Superintelligence Trustlessly Aligned AI
Audio Transcript:
because we are a little bit over time, unfortunately. At 5 o'clock, sorry, at four o'clock, yes, at four o'clock there's going to be keynotes on the main stage, So look forward to seeing all of you there and hope you have a good rest of your conference. Thank you. Thank you. Thank you. Thank you. Thank you.
Video Content:
A speaker is giving a presentation at an AI Engineer World's Fair event. The speaker is standing behind a podium with the Microsoft logo on it. The background is black with white text displaying "AI Engineer World's Fair" and the Microsoft logo. The speaker appears to be discussing real-world development using GitHub Copilot and VS Code.