Inside ASUS's Massive Test Bench for Enterprise Data Center Hardware

Inside ASUS's Massive Test Bench for Enterprise Data Center Hardware

ASUS showcases its massive test bench for enterprise data center hardware, including liquid-cooled GB300 nodes and validation for next-gen Vera Rubin systems.

The Biggest Test Bench I’ve Ever Seen. | Transcript:

When developing or evaluating new hardware, a test bench is an essential piece of the puzzle. It saves a ton of time just by putting power, cooling, and important tools at your fingertips, which makes it quick and painless to swap the device under test. What it also does is provide a controlled environment so that accurate comparisons can be made to earlier models or competing solutions. But what if the new hardware that you need to evaluate is a giant rack that weighs nearly two tons and sucks back over 100,000 watts of power? Well, then you're going to need a bigger test bench. And that's exactly what ASUS sponsored us here to see. They're going to be showing off the R&D lab where they

are hard at work performing development, maintenance, and long-term portraiture testing on their enterprise and data center products. We'll be looking at their current setup for Grace Blackwell, including my first look under the hood of a liquid cooled GB300. And we'll be talking about some of the upgrades that they'll be making to accommodate the next generation just announced Vera Rubin. This R&D lab is purpose-built to test rack scale products. In a lot of ways, it's kind of like a miniature data center, but at a much smaller scale, and with a bit more flexibility. Here in the center is a traditional air cooled setup. So, chilled air comes in from the sides, runs through the servers, and the

hot air gets sucked up into the ceiling above us. But because many of the systems here are newer and so power hungry, many of their customers have moved toward liquid cooling and they've got to be able to validate those, too. So, there's a dedicated CDU or coolant distribution unit next door that brings chilled water under the floor and then up anywhere that it's needed. We're going to go look at that in a minute, but first let's look at the kinds of systems they're testing in here. This is a Grace Blackwell GB300 compute node that includes two Nvidia Bianca boards side by side. Each of those boards gets a single 72 core ARM Grace CPU that uses LP cam to offer flexible memory configurations alongside

a pair of Blackwell Ultra GPUs that have 288 gigs of HBM3e each and consume up to 1,400 watts each. To improve efficiency and keep wire gauges down a little, Nvidia is using a 54vt architecture rather than the 12vt that we use in desktop PCs. And the DC/DC conversion to split out to the rest of the system is done with this power supply right here. For networking, each of our Biancas gets a pair of connectex 800 Gbit per second nicks, which contribute alongside the CPUs and GPUs to this thing having a power budget of about 8,000 watts. That is why liquid cooling is a must. Let's take a closer look at this module. Man, this thing is heavy and gorgeous if you're well, if you're into that sort of

thing. Uh, it's all color coded, too, so you can see where the cold supply comes in here. Then splits off in parallel to go to each of those Blackwell 300 Ultra GPUs. Then it comes together to cool both the gray CPU and you can actually see the contact pads for all of the LP DDR5X memory that goes around it. That's not necessary on the GPUs because they're using HBM3 Estacks and those are right on the same package. So, those are cooled by this single giant plate for each of them. Another thing you might be wondering about these is, hey, what's up with these little flexible PCBs that kind of look like little antennas? Those are leak sensors. If any water gets on these, it bridges the two sides and

feeds that into ASUS's management software, which we're going to take a look at a little bit later. The last thing this cools are the network cards. Water cooled networking. That's what we've come to. Of course, nobody buys just one of those nodes. So, the R&D lab has to accommodate rack scale deployments of them. The total power coming into this room, it's about 1.1 megaww right now. But, uh, with validation coming up for Vera Rubin, they're going to have to upgrade that. Fun fact, by the way, Vera Rubin is more power hungry and also heavier than GB300 with ASUS estimating that each rack will come in just shy of two tons. Uh, you

guys might need a heavier duty floor. And if they do, that's going to come at their own cost. While Nvidia may at their discretion provide GPUs and CPU chips for development purposes, the responsibility falls to the manufacturer of the rack to procure, design, and build everything else around it. I was really interested in what kind of testing they would do on systems like these. And from talking to ASUS, they say that the exact software differs, but the function is actually surprisingly similar to what we might use to validate a desktop at home. They use a combination of their own software and packages that are provided by Nvidia to artificially load the system, often

running it 24/7 for long periods of time while simultaneously monitoring for everything from temperatures to data errors or especially any anomalies in transfer feeds and latency whether it's from GPU to GPU, GPU to network or GPU to storage. And this is key because while compute matters a lot in any kind of AI inference, the latest reasoning models are especially sensitive to how quickly you can move data through the system. On the subject of speed, ASUS pointed out several times actually that one of the best things about this lab is that it is literally just a few blocks away from their mass production. that helps make it easier to collaborate whenever they need to try to reproduce

an error or roll out a fix. Now, let's roll out and check out the chiller. This is one of the least chill chillers that I've ever seen. Not cuz it's a bad one. ASUS says it's good for about 1.3 megawatt of cooling capacity. It's just not that chill because ASUS takes a bad approach to managing the thermostat. While most data centers use very cold water, data center Dynamics says around 6 7° is typical, ASUS targets more like 20° for the water that is going over to the racks next door. That's not something they actually recommend to customers, but it's good enough for a test bench and apparently saves them about $20,000 a year in energy costs. So, I think I

finally get it. why dad was always so stingy. And besides, it's not like ASUS doesn't still have access to colder water if they need it. The piping in here is all colorcoded. So, yellow is the coldish supply to next door. Green is the warm return. Blue is the R134A refrigerant that chills the water. And then white here is actually a buildingwide cold supply that does run at 7C and handles chilling the air in the cold aisle next door for any air cooled deployments. Again, remember this setup is all about flexibility. Now, let's talk about endurance. This environmental chamber makes mine look honestly like a toy.

Even my big one, which by the way is still for sale. Just slide into my DMs. Anyway, this mamajama can do up to 100,000 watts of cooling and has a temperature range as low as -40 and as high as 85. Now, nobody would want to put a live server into temperatures that cold or that hot. Not if they want it to stay alive, but then again, sometimes they don't want things to stay alive. in here right now. ASUS has some GB200s that are undergoing long-term aging analysis. That means, well, we'll open that later. Putting them under dynamic loads. So, sometimes very heavy, sometimes lighter, and wildly varying environmental conditions as low as minus 20 and as high as 45. Let's see what it's at right now.

Oh god, that's unpleasant. Okay, we're going to have to get out of here pretty quick. It's 110 dB and almost 40 C. Oh, it's so humid. I uh I rescued your temperature sensor. You're welcome. What about second Thermal Lab? This one behind me is less for long-term aging and more for cooling validation. In there right now is an Nvidia HGX. And what they want to know is, okay, we can see how the fans ramp up at 25° C, but what if it's deployed in a data center in say India and they experience some kind of challenge with their cooling? For that, they can turn this thing up as high as 45C.

How much will the fans ramp up? Will the system be stable? There's only one way to know. Of course, nobody wants to hang out in there during all that testing. So, that's where ASUS's software comes in. AIDC is for design and deployment. So, they've got everything from a planning utility that lets you just kind of plon servers down in 3D space and calculate your structural power and cooling requirements to handy scalable tools that can do operating system deployments, driver management, firmware updates on everything from boards to nicks to switch trays and even SSDs. It's even got an app store, but instead of like I beer, like Weta, you know. Then next to this, they're showing off

ASUS CC, which is more for the long-term management of your deployment. They did a quick demo for me showing how you can track all the vital stats for your site and use this to drill down to say the specific machine that logged a given error. You can also create custom dashboards that will report on whatever is important to your organization like carbon emissions or quality of service thresholds. And then there's a bunch of tools in here for setting rules around software access and managing notifications. Now it's time to manage your attention to another video you might like. How about the one that we did touring Simon Fraser University's latest supercomput? That was a really great look at what a realworld deployment of this kind of

tech looks like.

More Tech Transcript