NVIDIA's New AI Model Teaches Robots to Move Like Humans

NVIDIA's New AI Model Teaches Robots to Move Like Humans

NVIDIA's latest AI model, Sonic, enables robots to mimic human movements with remarkable precision. Trained on 100 frames of motion, the 42-million-parameter neural network can translate video, voice, music, and text into natural robot actions. The system uses a root trajectory spring model to prevent injury and ensure smooth transitions. Despite requiring 128 GPUs for training, the final model runs efficiently on a phone. This breakthrough opens up applications in dangerous exploration, disaster rescue, and planetary exploration, all while being open-source for public benefit.

NVIDIA's New AI Broke My Brain. | Transcript:

Let's see what is going on here. This is me around 9am. A bit wobbly, steps are unsure, yup, that checks out. Now then, give me my fake badge. Thank you sir. Hehehe, no one noticed. Now let's proceed to the next step of my mastermind plans. Let's eat all their food. Wait, they noticed. Proceed to the next step. What was that? Oh yes, run! Now, jokes aside, look at that. Sign up for this one baby. Oh yes, please mow my lawn. That is excellent. Rake the leaves! Perfect. Hey, don't slack off, that's my job!

Okay, so what is going on here. Let's start with the good news, this is a new teleoperated robot controller and more. They call it Sonic. Now the work here is not the robot, but the software controlling it. At least in this footage, watch until the end and you might get surprised. This means there is a human performing these movements, and the robot is able to understand these motions, and then translate them to a bunch of joint positions in 3D space. It's kind of insane that this is possible. But it will just get better and better as we continue the video. So, before you ask, yes it can do kung fu. Provided that you can do kung fu. It

understands whole body movement, so you can get it to crawl into some space you don't want to go to. And that is super useful, people are already using robots for that. Why? Well, chiefly, for exploring under explored and dangerous areas. This means tons of useful applications, for instance, a variant of this could help save humans stuck under rubble, or perhaps later, even explore other planets without putting humans at risk. But that's still nothing. Because this is a multimodal system. Meaning that the input can be almost anything. So, you say that I don't have to pretend to mow the lawn to actually mow the lawn,

because where is the fun in that? Well, just tell it to do that. Can you? Well, currently, for simpler tasks, like moving around or behaving like a monkey, yes you can! Absolutely incredible. And I love how expressive it is. You can ask it to walk happily, stealthily, or like an injured person. And you know, just the fact that it is stable and does not fall is remarkable. Previously, even in simple characters in simulated worlds, you needed thousands and thousands of tries to teach them to just be able to walk without falling. And now, this, is a huge leap forward. Wow. But it gets better, we said multimodal. Yup, that means that the input can also be music. I'll show you the dancing, but not the music because of Youtube reasons,

but I put a link in the description where you can check it out. And we haven't even talked about the most insane part of the whole thing. Now hold on to your papers Fellow Scholars, because this runs with about 42 million parameters. That is a neural network so simple, it can run so easily on your phone it barely notices it. It may even run on your toaster these days. That size is absolutely nothing. This is an incredible achivement. Okay, but how? How is that even possible? Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. Well, first, it looked at 100 million frames of human motion

to understand what we do and how we do it. The incredible thing is that this system does not require human-made action labels, so we don't have to explain our movements. It just watches the raw motions and figures out how to transition between tasks without any unnatural pauses! So then, your multi-modal input goes in, a video of you, your voice, music, or just text. A motion generator turns these into human motion, and the human encoder processes it into a latent space, and then a quantizer converts it to universal tokens. Once again, universal tokens, that is key,

you'll see a bit later. Then, the decoder translates these tokens into motor commands. But there is a big problem. Learning to convert one to the other is super hard. First of all, robots do not work like humans, that is one of the fundamental challenges. So if the user commands you to turn around, it should be turning around. Okay, sure. But how fast exactly? You don't want to try to turn 180 degrees too quickly, because you would fall apart. To solve this, in their research paper, they propose what they call a root trajectory spring model. This dampens sudden, quick user commands so the robot does not get injured.

Yes, robots can get injured too, which is kind of hilarious. Now there is an exponential term as a function of time. What is that? That is a physical brake. As time increases, this term rapidly shrinks to 0, which forces the whole mathematical expression to decay smoothly. This serves two goals: one, the robot does not injure itself and two, it will settle at a target position without oscillating back and forth forever. Nice. Now, do the dampening too much, and of course, you'll get a little slug that can't get anything done, so it's really tough to do well. Well done folks.

Now, all this took 128 GPUs and 3 days to train. That is expensive. But here's the key, after the training is done, the final product is so lightweight, we don't need this kind of hardware to run it at all. In fact, all of the models showcased in these videos will be given to all of us for free, forever. They run on your phone, easy-peasy. That is incredible. Open research for the benefit of humanity. Love it, thank you so much. This project is led by professor Zhu and Jim Fan, who I love dearly. Jim started the humanoid robots lab at NVIDIA just 2 years ago, and they are raining research papers on us, breakthrough after breakthrough. Insanity.

And to compress all this human movement knowledge down into a tiny little AI controller that can be used by any of us is simply a stunning achievement. It turns out, training a good AI requires coding good thinking into a machine. But, surprisingly, we ourselves can also learn a lot of good life advice from this kind of thinking too. For instance, the model compresses a messy, diverse soup of inputs into a kind of pure, abstract token. You know, in life, when asking other people for advice, you will inevitably hear everything, and its opposite too. That is also a

big soup of inputs. But try to look at all of them, side by side, and you'll find that they often share an underlying truth. This works, as is showcased by this incredible project too. And note that this work is not the end of anything, this is just a start. An early work at a nascent area. Two more papers down the line, and I really hope this is going to start folding my laundry and cooking my lunch. That would be amazing. What a time to be alive! And this is not some proprietary nonsense, this is open knowledge and open

just dropped. If you are interested in hearing more hopefully soon, subscribe and hit the bell.

More Tech Transcript