DeepMind AI Solves Nine Unsolved Math Problems from Erdős Collection

DeepMind AI Solves Nine Unsolved Math Problems from Erdős Collection

DeepMind's AlphaProof AI solved nine unsolved math problems from Paul Erdős' collection, using a tournament system with Lean formal verification to ensure correctness.

DeepMind’s New AI Found A Strange Way To Think. | Transcript:

DeepMind's new AI just did something amazing, or did it? You see, there was a legendary mathematician called Paul Erdős, fellow Hungarian, who left more than a thousand open problems to the world to solve. Look, we Hungarians have a lot of problems. We got to contribute somehow, and this is our way of doing it. Now, DeepMind's new AI called AlphaProof Nexus tried to solve about 350 of them and came up with a 95.7% failure rate. Basically, it solved nine, and it only cost a couple hundred dollars per problem. Is that good? Well, I got to say, that is incredibly super good. Why? Well, these are decades-old problems that were not solved by anyone yet. The other line of criticism I hear is that this did not do fundamentally

new things. Is that a problem? I think not. Why? Well, let's look back to 4 years ago. GPT-3. People said, "Well, it can't even add numbers together reliably." Then, 2 years ago, people said, "Well, it can't even solve high school competition problems reliably." Then, 1 year ago, people said, "It can't even win the Mathematical Olympiad gold medal reliably." And today, they are saying, "Well, it can't even solve 50-year-old unsolved problems reliably." Do you see where this is going? It is clear as day. Please apply the first law of papers here. It says, "Do not look at where we are. Look at where we will be two more papers down the line." And this result is absolutely amazing, stunning,

even. So, how did they do it? How is that even possible? Dear fellow scholars, this is Two Minute Papers with Dr. Károly Zsolnai Fehér. Normally, you would reach out to some AI assistant to take a crack at it, but it won't solve it because they hallucinate and make things up. To avoid that, they make it use Lean, a formalized mathematical language where it's easy to check whether your proofs are correct. Is this new? Not at all. Everyone is doing that today. Okay, so what's new here? Look, first, a mathematician writes down the problem in Lean and the solution. The proof is left blank. Then, the AI agent tries to solve it. Of course, it fails. Too hard. Then,

another AI checks it and says, "Mhm, this is not great." But, it also says why it's not great. But, here's the key, this guy right here. This is a cheaper judge AI that reads two previous solutions and picks a winner. Both solutions can be wrong, but it picks the one that is a bit better. Now, this is genius. Why? Well, hold on to your papers, fellow scholars, because it's kind of like a chess system where the solutions are the players and each of these players gets an ELO score, also named after Arpad Elo, fellow Hungarian. Look, sometimes we provide solutions, too. So, each proof now has a score. And now, we start again. But, not from scratch. No, no, no. We start out from

the highest scoring bad solution. So, this is now a tournament. Do this over and over again. So cool. And now, we keep running and running this tournament until the validator says, "Yep, this one checks out." And then, we have a formal proof. Nailed it. This is incredible because it takes an unreliable AI, runs it over and over again, and it can lie its rear end off as much as it wants, and we still get a reliable system out of this. A reliable system built out of unreliable parts. I love that. And the fact that they put all this research out there in the open for free for all of us.

Chef's kiss. Thank you so much for everyone who worked on this. What a time to be alive. But wait, interestingly, the story of AI so far has been that we make it smarter. Now, the story has changed. We don't need to make it smarter, we need to make the harness around it tighter. Give it a good judge. Let it a thousand times and it will slowly work out the right solution to incredibly hard problems. So here, the intelligence is not just in the model, but it is in the loop around it. Everyone is experimenting with different kinds of loops and it is super fun. I do it too on lambda. Okay, not even this technique is perfect. Limitations. In other words, the stuff that you don't hear about in mainstream media. So one,

why not test on the full 1200 Erdős problems? Well, there is a little selection bias here. I think they took a subset of 350 that was easier to formalize. Is that a problem? In my eyes, not at all. You got to start somewhere. Let's not be one of those people that say, well, it can't even solve the 50-year-old unsolved problems reliably. What it has achieved is incredible. Now, two, smaller models solved zero problems. Zero. Nothing. You still need a beefy AI system at the core. That is an interesting case because people keep showing these benchmarks where the super fast cheap model is just a couple percentage points away from the frontier. And whenever I try them, they

always seem a great deal weaker. This seems to reinforce that. Also, people will probably start thinking, do I use a larger model with fewer tournament rounds or do I use a smaller one with more? Assume that they cost the same. Interesting question. Now, where does this put us? Well, an AI just solved nine math problems that no human could crack in 56 years for a couple of hundred dollars each, and they did it by letting an unreliable AI fail thousands of times against a judge that cannot lie. And we went from can't even add numbers to solving decades-old open problems in the span of four years. And I think that is insane. But, limitations apply. Also, models used to be the only

thing that matters. Now, harnesses, loops around them, also matter. Now, I recently talked to Pushmeet, one of the leaders of the project, and he's amazing. I am just a student who loves to travel the world and tries to learn from incredible scientists like him and bring that knowledge to you fellow scholars. And it [clears throat] is a huge honor for me to be able to talk about it to such a super smart audience as you fellow scholars. Subscribe and hit the bell if you feel that this is the way of doing it. Thank you so much for being with me all these years and over more than a thousand videos. We need new tools for the era of LLMs, and Weights & Biases now has weave, a lightweight toolkit to confidently

iterate on LLM applications. Use traces to debug how data flows through each step of your app, and use evaluations to measure your progress. It is the best. Try it out now at wnb.me/papers, or click the link in the description below.

More Tech Transcript