DeepSeek's New AI Model Adds Visual Reasoning for Faster, Cheaper Results

DeepSeek's New AI Model Adds Visual Reasoning for Faster, Cheaper Results

DeepSeek introduces a new AI model with visual reasoning capabilities, allowing it to point at images and think step-by-step, reducing errors and token costs. It outperforms many frontier models on benchmarks while being free and open-source, though it has limitations with thin structures and generalization.

DeepSeek’s New AI Is A Game Changer. | Transcript:

Hmm, why does this deep sea quirk exist? I mean, it adds vision capabilities to the deep sea AI system, but that's not new. A lot of other AI systems have vision capabilities. You just drop an image here and it works. Even video and even for open models. So, why do we need this paper? Well, they did something incredible here and it is an absolute game changer. Why? You see, if you ask a previous technique to count the number of people in this photo, it will think something like this. Okay, there are people on the upper left and a bunch of stripy guys in two rows. That is kind of three rows. Some of them are standing, some of them are sitting.

Ah, it's just so confusing to just count them up using only words. Two problems with this one. One, this is prone to error. Two, you have to think a lot. Just describing stuff. Why? What would we, humans, do? Of course, we would use our finger and would point at the image. One, two, three, and so on. Done. Don't describe images like a poet. Point like a human. Now, that is exactly what this new technique does. It allows an AI system to point at things while thinking and it is absolutely brilliant.

This makes it more accurate and it also makes it faster. In a world where hardware and tokens cost a fortune, it is fantastic to have something that gives us results faster and cheaper. But, it turns out thinking with visual primitives has even more advantages. It can also do topological reasoning. For instance, if you give it a maze with a start and end point, you not only get a correct answer to your questions, but you can also trace back the whole thought process visually. I love that. Also, here you can ask where the crown connects and look. To the octopus. Yeah, it answers correctly, but you can also see how it came to that conclusion. Now, make no mistake. These are simple examples. I'll

show you in a moment if it is as good as these billion-dollar frontier models. Also, if something goes wrong, this will make it easier to find mistakes and fix them to create an even better model. This puts us one step closer to AI systems we can actually understand that do not just give us a soup of numbers. So good. So, how good is it? Well, hold on to your papers, fellow scholars, and I dropped my papers here. Look, it needs about 90% fewer visual tokens than most frontier models. Now, wait, wait, wait. It doesn't matter how little you think if you just say three as an answer without thinking. Thinking time doesn't matter if it is incorrect. So, how accurate is it?

Are you kidding me? This free system matches or beats almost everything. And once again, we are talking about this, which is free, going up against billion-dollar systems here. Wow. Now, we are fellow scholars here, so at this point we ask, are these results real? You know, benchmarks are being gamed left and right. Now, here is what many people missed. Average over seven benchmarks, but in-house benchmarks excluded. That is the key. They did not rig their own benchmarks. You know why? Well, everyone loves it because it's one of the oldest tricks in the book. If you are not performing well, just create a new benchmark that fits you. Let's make a YUNUS benchmark. You will always be

world first in being you. And this is not the case here. Amazing. This is free and open research. So, this technique can potentially be added to many existing models, including free ones. This paper does not have a model attached that I know of. It describes the concept of how to do it in detail. It's a blueprint, if you will. More intelligence for all of us for free. The world needs more papers like this. Love it. But, this all sounds like magic. How did they do this? Well, look, this is their own policy distillation objective. We need exactly this. You see, normally, we have a bunch of expert

AI models. Now, at the risk of simplifying things, imagine that one of these guys is great at boxes. Nobody does boxes better than this guy. The other one is great at tracing mazes with points. But, that's not what we want. What we want is one AI that can do all of these things. And that is where this comes into play. We train a student model that learns from all of these teachers. It says what it would try to do, then the teachers say, "Okay, here's what I would have done." Do this enough and the student will be pretty good at all of these different kinds of visual thinking. This is why they used the name distilling the knowledge of a bunch of expert teachers into a student. So,

where does this put us? Okay, so here's what I think. Dear fellow scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. You know, we always thought that we would make AI systems smarter by giving it higher resolution images to train on. More pixels, more smarts. It turns out not true. Sometimes, that's not what we need at all. Deep Seek just cut down those visual tokens by 90% and still beat frontier models. Less is more. Now, is this perfect? All problems solved? No. Limitations. One, the AI does not automatically do this kind of pointy thinking. It needs a word as a cue for this kind of thinking. Two, bounding boxes are nice for people, but if you are counting blades of grass or strands

of hair, now, in this case, not having those in very high resolution is a problem. Yep, once again, the two-minute papers special, thin structures. Every time, man. It's so painful. And three, this kind of topological reasoning does not generalize as well as we'd like. It might not be as robust when you show it something completely new. So, careful with the misleading media headlines, careful with the hype everywhere. There is still plenty to improve here. But, I feel that this might be a breakthrough. And that makes it maybe the third one this month in AI research. What a time to be alive. Also, with large AI companies going to IPO, they are about to become ventures that look to maximize their profits. More

money needed every quarter. So, it's going to become more and more crucial to own your own AI systems with free open weights models. And this one makes them better. Love it. Here you see me running the full DeepSeek AI model through Lambda GPU Cloud. 671 billion parameters running super fast and super reliably. This is insane. I love it and I use it on a regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers or click the link in the description.

More Tech Transcript