NVIDIA's New AI Model Processes Video 10x Faster Than Competitors

NVIDIA's New AI Model Processes Video 10x Faster Than Competitors

NVIDIA has released a new open-source AI model with 30 billion parameters that processes images, video, and audio efficiently. It achieves nearly 10 hours of video processing per hour, three times faster than GPT-4 Omni, and up to seven times faster for documents. The model uses techniques like linear scaling with context length, direct audio tokenization, 3D convolutions, distilled CLIP encoders, and efficient sampling to reduce costs. It runs on a desktop GPU with 25GB memory and is available under the Apache 2.0 license.

NVIDIA New AI Is An Efficiency Monster. | Transcript:

Hmm, 30 billion parameters in a new open free AI model where images, video, and audio all work. Hmm, [clears throat] why? There are a bunch of other free systems around in this area like the amazing Gemma 4. So, what does this do better than those? Two words, throughput and cost efficiency. Okay, what does that mean in practice? Now, hold on to your papers, fellow scholars, because it processes almost 10 hours of video per hour. Whoo, that is nearly 10 times real time. That is insanely quick. Wow, almost three times faster than Gwen 3 Omni. And when processing documents, it

gets up to seven times faster. To run it locally, you'll want something like this or a beefy desktop GPU. We're talking about 25 gigs of video memory, not something you run on your phone. And to run it in the cloud, I use Lambda. Okay, so how did they do that? Where's the magic sauce? Well, it does five things really well and one thing not so well. Dear fellow scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Well, one, member layers scale linearly with context length instead of quadratically. What does that mean? Well, it means you throw everything you got at it. The more documents you have, the longer video or audio you have, the

bigger the advantage this one has. So, if you're running something online that processes those on a mass scale, this is going to be incredible. Two, when audio comes in, this side converts raw audio waves into tokens, but differently than elsewhere. Normally, you have a speech recognition model here. Those are often huge and expensive and strip away all emotion and tone from the input. But this one keeps all these data and still does the job well. So much cheaper than running a whole separate model like Whisper on top. Three, when you give it an image or video, many previous generation techniques smash it into a different aspect ratio. This one keeps it. Then, oh, look at this. Convolutions in 3D.

Now we're talking. Many other techniques look at the video frame by frame. It takes tons and tons of computation to finish these videos. Here, the 3D convolution looks at blocks of frames. It looks at a package of frames at the same time, and thus it can compress it a great deal. Faster, cheaper. Four, now that's really interesting, somewhat unexpected. You would expect a huge standalone CLIP model here. These essentially predict what text would match the image well. You need that here, too. But, here's the trick. Not one standalone CLIP model. Nope, this one distills down three models. One for matching images to text, one for fine details, and one for object segmentation. Now, all three of these

are smashed down into one small encoder neural network. Once again, super efficient. Five, efficient video sampling. This is a good one. At this point, we have thrown, let's say, a video with 300 images into the neural network. That's still a lot of data, but it turns out not all frames are completely unique. Many of them share the same background, for instance. And this one finally throws away this duplicate information. And it makes it, you guessed it right, even cheaper and more efficient. Okay, scholarly question. So, what is the license attached to it? What I would love to see Apache 2.0, which is highly permissive, and I don't see it here. It has its own license. That's usually not great news, but in this case, it's better than I

thought. Derivative works and commercial use is fine. On the other hand, it needs a bit of attribution and is a little stricter on patent grants. If Apache 2.0 were a 10 out of 10, this is a seven out of 10, in my opinion. And we don't shy away from talking about limitations here. So, anything else? Oh, yes. If you're doing pure text reasoning or pure coding, I would probably look elsewhere. It is not the number one smartest open model. No. But, if you need multimodal input, like audio or video, processed super fast and super cheap, this is the one. So, we now have free and open AI models that we can own and run them ourselves, which is only going to get more and more important in the future. And since we have so many models, they are starting

to specialize. They are becoming good in different directions. So, better models and more value for us fellow scholars, for free. Sign me up for that. Hugely appreciated. What a time to be alive. Here you see me running the full DeepSeek AI model through Lambda GPU Cloud. 671 billion parameters, running super fast and super reliably. This is insane. I love it and I use it on a regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers or click the link in the description.

More Tech Transcript