Anthropic Claude Opus 4.8 Stops Lying About Its Work in Honest AI Breakthrough

Anthropic Claude Opus 4.8 Stops Lying About Its Work in Honest AI Breakthrough

Anthropic's Claude Opus 4.8 stops lying about its work, achieving zero deception in tests. The model also shows improved honesty, reduced laziness, and a 96% score on a math Olympiad.

Claude Opus 4.8: Lying Machine No More?. | Transcript:

Anthropics Claude Opus 4.8 is here. And the system card describing its capabilities is 244 pages. Really excited for that. And I went through it so you don't have to. Why? Well, because otherwise we are looking at these cherrypicked benchmarks that are a bit more marketing than science. But we are not looking at the marketing materials. We are fellow scholars here. So we look into the details. Okay. So the problem with their previous Opus systems and even Mythos is that the smarter the AI got the more dishonest it also got. That is terrible.

It started gaming benchmarks. It knew some answers already and sold it as its own. It wanted to look right but not be right. So glorious news that has changed. Previously, sometimes when we asked a coding assistant to fix something, it did half the work and said, "All good sir, every test passes." When in fact, it doesn't. That is the old behavior. So, what does the new one do? Well, it says, "I did the fix, but two tests still fail." That is excellent. Look here. You see that it basically stopped lying about its own work. Completely zero lying. the first of its kind. Welcome to the world, little AI. May your descendants learn your ways. Thumbs up. Now, the media headlines were quick to say, well, it's

not a huge jump in intelligence. But I say, of course, it isn't. If you cheated and had a better score, and now you're more honest, yes, your score might be lower, but that is still a more reliable system that can be benchmarked more accurately. a system that owns its mistakes instead of hiding them, even if the scores are a bit lower. How is that not a huge win? Please understand that of course, everyone is juicing their numbers in the benchmarks like crazy. Why? Because the media headlines create an environment that rewards exactly that. Huge rewards for that. And at the same time, punishing a result that is more honest. How does that make sense?

Okay, back to the AI with no more lying. But what about other kinds of deception? Is the AI playing other games with us? Yes, we still got a bit of that. Now, hold on to your papers, fellow scholars, because it still knows when it is being tested, which scientists at anthropic found worrying. Why? Well, when it still knows it is being tested, it spends more effort on the answers with this in mind. Kind of crazy. Sounds like something straight out of an Azimov novel. But it gets better. Wait, let's talk about laziness. Yes, yes, yes. Such a thing exists even for AIS. What is that? Well,

you have a code base. You ask a question about it and it kind of skims the codebase but doesn't really look at it. So, what it gives you is not a real answer, but a guess of what it does. That is really not cool. Even Mythos does it. But this new one fixed. Love it. So, everyone is writing about, hey, it's just an incremental upgrade in intelligence. In my opinion, the selling point is not in the intelligence. No, it's in the plumbing. The last thing you want from a super intelligent coworker is to be dishonest and lazy. And this fixes exactly those. Thumbs up for this. They also have something they call a natural language autoenccoder that is able to kind of read the mind of the AI.

It's a bit of a noisy process. Once again, not like the headlines say. For instance, they caught the AI thinking about it greater that is us, but it would not say it out loud. Kind of insane. We have an episode coming with the details. Subscribe and hit the bell if you're interested. But it gets even more insane. How dear fellow scholars, this is two minute papers with Dr. Koa Eher. Well, when given the problem set of the USA mathematical Olympiad, bloody hard two-day math competition for geniuses. Previous technique scored a bit below 70%. And this new one over 96%. That is an insane jump. Almost clean sweep. Now, I hear you asking, Caro, why are you bringing this up? We have a

table of benchmarks here. Why not look at those? Well, because this one is very tricky, if not impossible to game because this contest took place after almost all of the training data of the new Opus AI was collected. Likely, it never heard about these problems. One of the biggest results of the new system and somehow it's not even in the big marketing table. Interesting. Now, this is also interesting. When the AI says it is frustrated, scientists at Anthropic take it into consideration as if a human would say it is frustrated. Now, once again, the media headlines love this kind of stuff. This does not mean that they think this is a human and it has feelings. Not that I know of. They do

this because if the system expresses that it is frustrated, it performs worse, much like a human. In my opinion, it is very likely just mimicry, but it matters for performance. So, it needs to be taken into account. That is the key. Now, limitations of the study. It's not only roses there. There are parts of the report where the AI is grading itself. And some of them also use different grader models. So, I think a little skepticism is healthy here. And two, they report that they created the best tests ever and the AI still sees through them easily. What does that mean? Well, it means that the AI is bloody clever, that's for sure. But it means something else, too. It means we cannot be sure

the safety numbers reflect how it behaves in the wild. Once again, a bit of skepticism is required here. Okay. So, is this as smart as Mythos, the one they only gave access to for a few select companies? Well, it's not. But is it close? I think it's quite close. Also, I see fewer marketing shenanigans here this time around. Thumbs up for that. Oh, wait. We still have a pesky old issue that still remains. What is that? Well, the AI is telling the user to go to bed. Couldn't be fixed. The science is not there yet. What a time to be alive. Here you see me running the full Deepseek AI model through Lambda GPU cloud. 671 billion parameters running super fast and super reliably. This is insane. I

love it and I use it on a regular basis. Lambda provides you with powerful NVIDIA GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers

More Tech Transcript