AI Cost Crisis Drives Companies to Smarter Model Routing

AI Cost Crisis Drives Companies to Smarter Model Routing

Companies are shifting from using only top AI models to routing tasks based on complexity, cutting costs significantly as AI spending grows.

CNBC

The Fix For AI's Spending Problem Is Not Good OpenAI And Anthropic. | Transcript:

Corporate America is discovering a new way to buy AI, and it could change the whole trade. If it's this really, really tough bug that's been stumping your engineers for weeks. Sure. Like put the best intelligence on that, right? But, but for a lot of the boilerplate work that you're doing, what we're seeing is you can get five times better, ten times better cost efficiency by just using models that are still good enough for that task. Employees in the company use about $200 worth of tokens every week, 50 weeks a year. That's 10,000 bucks in tokens. You have 40,000 employees. That's $400 million.

You have 90,000 employees. That's $900 million. There's going to be a concerted effort that the industry takes in saying, what do we need to do to make sure that these models are more efficient in token generation. The AI trade, it's priced on the idea that companies will pay anything for the best model. That assumption, though, it's starting to crack. Here's what the people building and buying this stuff are seeing. The old way of buying AI is pick the best, most powerful model, run everything through it, no matter the task. It's like putting your smartest, highest paid engineer on every job in the building, from solving the hardest bug to resetting passwords or

checking the weather. The new way is to match the job to the model. Hard problems. Send it to the top model. Easy problem or easy task. Send it to a cheaper, faster model. The industry calls this model routing, and it is taking off. One company that sits in the middle of all this open router says it just saw its volume go up fivefold in just six months. As its CEO put it, the era of picking one model is over. So why now? Well, simply the bills came in this week, Sam Altman said. It's the first time open AI customers have complained about costs.

He referred to customers that have been blowing through annual budgets in months. You've heard those stories and it turns out for a lot of everyday work, the cheap models are good enough. Run the top model. It costs around 25 bucks for a batch of output. Run a cheap one that costs under a dollar. That is the trade off the companies are making now as costs keep climbing. So routing better efficiency. Yes, that is great for the companies that are writing the checks, but follow the money if the routing sends all the easy work elsewhere.

Openai and anthropic, they only get paid mostly get paid, I should say, for the hard or the sensitive stuff and the entire IPO story for both of them. It's built on endless demand and premium prices. Now, last week we talked to Matan Greenberg at Factory AI, whose whole business is building that routing layer. Today we've got another company going after the same problem cognition maker of the coating agent Devin, which routes tasks across models automatically. Tasks and agents, I should say. Today they're out with a new announcement about what that AI is actually worth to their customers and to them.

Ceo Scott Wu is here to start us off. And then later, Cisco's Jeetu Patel on what this looks like from inside the enterprise, where these bills actually land the trade offs that they're now starting to make. Scott, great to have you in person. Thank you for coming in. Yeah, thanks for having me. I told you before, I've been wanting to talk to you for a long time. Ever since you had that demo video with Devin and your inbox blew up. So. Yes, yes. No. Super excited to be here. It's been a long time coming. Um, talk to me about what I'm calling this ROI guarantee. You guys are calling it an AI productivity guarantee.

I think it's so interesting because it's kind of like putting your money where your mouth is as a vendor saying if you're not actually getting value, like we're going to pay you for it. Yeah. Explain it first. Yeah. For sure. Well, so, so you're totally right about, you know, seeing that the bills are coming in, spend on AI is getting higher and higher. And I think you see these kind of two polar ends of how folks are reacting to it, right? One is obviously people going crazy with it and using all sorts of use cases and putting tons and tons of spend and the bill going, you know, out of the water. And the other side is folks who are too scared and too hesitant

about using it at all to, to even try anything or to, to like, really, really cap themselves on their own budgets, right? What's the right answer in the middle? I think the obvious kind of middle ground that you want to solve for is you want to know exactly how much you're getting out of all the tasks that you run, right? And if you spend $5 on a task, that's fine. As long as you're getting $20 of output out of it, it's not okay if you're spending $500 to get that $20 of output. And so making sure that you're doing that well. Okay, that's kind of the trade off now, but I feel like for the last few years, a lot of companies have said, we want to use AI.

We've heard about its promise. So we want to use the most expensive, the most secure. When I talk to a lot of fortune 500 companies and I say, why don't you use an open source Chinese model? It's cheaper. It's almost as good. They say, oh, we're not going to use the Chinese model, even if it's hosted locally, right? There's just that stigma around it. So yeah, like, why do you think even the enterprises that have still to jump in are going to go or want to use model routing or even consider the cheaper ones? So I think there's a few trends that we're seeing, and I think one of the important ones to call out is there are so many good models coming out, right? And so at this point,

you know, there was a time where if you were trying to run an agent, there's one, maybe two models that you could really use. Now there's dozens and dozens. I mean, there's a new release every other day, you know, and people are there are all these different models and you can kind of see their different strengths and weaknesses. Right. Um, the other thing I would say is folks are starting to think a lot more about mass automation of tasks. So we crossed this milestone with Devin, where just recently we got to the point where more than 50% of Devin's are actually kicked off,

not by a human. It's kicked off by a Devin, or it's kicked off by an event agents using agents. And obviously, if you're going to do bulk tasks of, hey, let's take a look at every error message that came in. Let's take a look at every report that, that, that came through or every, every ticket, you know, you naturally want to be able to run something that can support that much higher volume, right? Let's talk about a little bit more about like what you've done this week. One, you're doing agent routing, and that's just the ability to put agents, the best agents on the hard task, cheaper, efficient agents on lesser ones.

Also this guarantee. Did you feel like the guarantee to like, prove the value was something you had to do. Was that proactive or was that because you were hearing from customers saying like, their bills are out of control? It's something that we wanted to get on top of with folks. And, and I really do think of it as, you know, hopefully kind of the path that brings things forward for, for both of the sides. We've definitely seen customers, you know, a lot of folks who are very hesitant and, you know, folks have heard the story of, um, you know, here's how much budget I set and I'm blowing through all of this right now.

I have to be super careful. And obviously the answer is, well, you for sure want to be careful. You want to know what output you're getting. But as long as you're doing tasks that are giving you real output, then you know that you're getting the bang for your buck on that, right? On the other hand, obviously, to your point, we also have folks that are that, that are just getting started with it, that are, that are figuring things out, and for them to know what kind of value and productivity they're getting out of it is super important as well. Now, ROI is like notoriously one of the hardest things to measure, right in the AI world. So how do you make sure your customers.

I know you're being conservative and you're giving them the benefit of the doubt, but we saw what happened with like tokens and token maxing, right? If you allow someone to game the system, they will absolutely game the system. Um, how do you make sure that like customers and you're offering this to new customers aren't going to do that with your guarantee. Yeah. Just say didn't reach didn't match the output. No. Absolutely. So it's the same as it is with, you know, with, with the human software engineer, which is you got to think about things in terms of output and productivity rather than

activity, right? And so, you know, measuring tokens to your point, I mean, you can spend billions of tokens and be doing nothing with it. On the other hand, if you're looking at all right, how much code, what's actually getting shipped to production, what are the issues that are being solved? So all of our, all of our work measures as a function of hours of engineering effort, right? And it's a measure of this is how much more capacity you're getting. This is how much faster your engineers are going. This is how much more you're able to ship, right. Rather than, oh, here's how many lines of code the agent produced, or here's how many buttons the agent clicked, because who knows what that's going for.

How do you verify it then? Company says like, listen, we didn't get enough outputs for this. Yeah, yeah. So we have our own, we have our own model and our own systems. We've trained it on a ton of data from various different customer tasks that we've seen across the board from banks that we work with health insurers, you know, big enterprises across the board. Um, and all of those map to ours. Um, and so we have a conservative system where, um, you know, first of all, if the agent doesn't do useful work in that session, it's a zero, right? So if that, if that code that the agent wrote isn't getting used or

something, that's a zero. If it is getting used, we, we do a measure that understands um, roughly how much effort that task took. Right. And to call it out, you know, it's like there could be a really, really hard bug to fix that could be a two line fix, but could, could save your team days of work or weeks of work. Right. Um, and a lot of it is really corresponding things to that actual engineering effort. And so, you know, for example, like Mercedes Benz, you know, who's been using quite a bit of Devon. They saw a lot of tasks where migrations that were forecasted that were expected to take eight months. They were getting done in eight days with the help of Devon. And that's an incredible, incredible ROI.

Right? As long as you're getting that, of course you should be using the AI for it. You should be doing these things. On the other hand, trying all these different sessions that are going nowhere, you want to be doing less of that. And so just helping folks see where they are and aren't getting that. There is this idea that a lot of folks tell me in the industry that, like, you have to go through that period of waste to get your employee base comfortable with using AI. Do you think that period is over? So I think it's it really depends on different folks. And I think a lot of folks at this point, you know, we're in that inflection where people are at

different points in their AI journey. I think to your point, it's it's one thing I think as a human to say, you know, just running tasks yourself. All right, here's like 3 or 4 things. Let me see if I can do this. Let me see if it can do that. And you're totally right that, um, you know, at this point, I mean, there are a lot of things that folks don't even know. Ai actually can do that now, right? On the other hand, it's a different thing to be, to be setting up a, you know, a spinner that runs 50 tasks. And so that's, that's more of what we're trying to avoid rather than the like one off. To climb a leaderboard.

Exactly. Which also brings us to this idea of model routing, right? I mean, if you have people experimenting and they're not doing like critical workloads, they can use a cheaper model. What would you say to a non-tech company that is like seeing these costs blow up or even maybe seeing it on their margins already, and they don't even like Arvind from glean. He had this amazing stat when I spoke to him last week that 95% of enterprises he sees are still on the frontier models. Yeah. Like what would you tell these companies about model routing? Like how would you explain it to them and why they should be using it?

Yeah. For sure. So, so, um, you know, one of the things that we talk about in the industry is, you know, what folks sometimes refer to as use case saturation, right? And here's one way to put it. If you ask the AI model, hey, who was the third president? Every single model will tell you that it was Thomas Jefferson, right? It's not a hard question. You know, there is an answer and there's a right answer. Right. And I think for a long time, some of these more agentic tasks of, hey, build me this whole website, fix this entire bug, do this migration or version upgrade. People thought of as, okay, this is something that only the top 1 or 2 models in the industry can, can do.

And now what we're seeing is obviously there are dozens of models that are able to go and do that. And naturally, that means there's a lot of room to optimize on that price performance curve, right? And so I think as we see these models get better and better, of course, you know, if it's this crazy architectural issue or this like really, really tough bug that's been stumping your engineers for weeks. Sure. Like put the best intelligence on that. Right. But, but for a lot of the boilerplate work that you're doing, what we're seeing is you can get five times better, ten times better cost efficiency by just using models that are still good enough for that task.

How hard is it to switch? Let's say there's a company that's been, you know, signed up with anthropic and using cloud for the last few years. How like, what does that actually look like in practice? How do they do? They go to a cognition or factory AI or some other third party. Yeah. Well, to your point. So that's, that's, that's a lot of why we think it's so important for us to continue to be independent and to, to continue to provide this way. You know, we think of ourselves, we work very closely with anthropic and

OpenAI and Google and Xai and, and all these independent providers as well. Right. And to your point, it's, you know, what you see with folks is obviously it takes time to learn a new, a whole new interface. It takes time to learn a whole new product experience and so on. Um, but the ideal is you can have those models be routing and switching almost invisible. Kind of commoditized though, right? If you have it in the background, but I mean, like for a finance department, how difficult is it to switch from anthropic to something that gives you that optionality?

Yeah, no, it's a great question. And I think it's why more and more folks are in their commits are starting to think about, you know, okay, well, what do I, what do I want to bet on? And what do I want to commit on? And obviously, there are so many of these different models that to the extent that you can make a kind of a neutral commit that will give you access to all the different providers, um, that just gives you a lot more optionality. Now, whenever I talk about deep seek, there's a few people in the audience who freak out. Yeah. And because there's this sort of idea that deep seek isn't secure, even though you can host it. But still, I've actually heard legitimate concerns.

Right. You could say like some word that activates something in the model, even if you're hosting it locally. So what do you say to those, those same enterprises that have been resistant to Chinese open source models? Yeah for sure. So for first of all, I think there is a lot of value and I think really great American open source models. I think there are a lot of good teams that are working on that and pushing that frontier. And so even to the extent that you want to only be using American models, you can and you can still have that mix of price performance. Um,

you know, to your point, I think what I'd say is like, um, with all of these things, um, it's really about figuring out how to secure things. Um, you know, as models can make mistakes. That's true for American or Chinese models, frankly. Right. And so the way that we think about it is how do you have them work with all the same systems that a human will work on? Right. And so, you know, a human engineer could also make a mistake. And what do you do? Well, obviously you have a code review process. You have a QA process. You have, you know, a beta deployment, right? And all of these steps are intentionally meant to make sure that if there's bugs or errors or things like that, that you will catch that

early rather than having a single engineer dictate what gets shipped to production. Right. Um, it's the same thing here where I think it's very important obviously to, to be able to, to, to run the models on American soil, which you can do. And that already, I think frees up a lot of the, the, the anxiety for sure. But then it's just making sure you have the same guardrails and the same security guidelines as you would for, for, for any other employee. It really is an AI employee. And something that gets kind of lost in the conversation is like, all the hyperscalers are hosting a lot of the Chinese models too.

Just again, that idea of optionality. Um, I also say that margins are like a powerful motivator, right? If you're going to blow through your budget in a few months, maybe you're going to give Deep Sea or Kimmy another look. Are you seeing that happen already? So we're definitely seeing the mix of things start to shift. And, you know, I think there's, there's pressures on both sides. I think the big labs continue to release more powerful but also more expensive models. Right? And then on the other end, the cheap models continue to get better and better. And, and I think for sure, what, you know, what you might call that,

that, that price frontier curve, you know, it's not just a single one point that you use for everything. It's, it's several points and people start to use and mix and match between these different tasks. Yeah. So there's always going to, I believe this, there's always going to be a place for the frontier models. There's going to be critical critically sensitive work nationally, strategic work that you're going to want to use the frontier models. Um, but do you think that model routing kind of their growth, they're going to grow. I get that, but do you think that because there's so many more options than it's becoming commoditized that will hurt them.

Yeah, well, I'm happy to go on the record and say, I think that I think the Frontier Labs will do very well, actually. And I think. Well, their customers too. Yeah, yeah, yeah. So, so I think a few things. I mean, the, um, as always, it really comes down to, to what are you getting and what is the bang for your buck that you're getting? And it is for sure the case, of course, that a lot of the think of it as a spectrum of difficulty, right. On the one hand, you have the super easy tasks. Anything can do basic one liner, here's a spell check or autocomplete or whatever, right? On the other end, you have these like really,

really hard tasks. And of course, it is the case, I think that you kind of see that commoditization creeping where more and more of these tasks are, are able to get done for cheaper and cheaper. But it's worth saying that, you know, those hardest tasks are also worth a lot. Right, right. If you can do the actual hardest tasks and, and we'll kind of see that frontier of what counts as the hardest task continue to move. That's that, that continues to be worth a lot. Yes. But they've been priced like they're taking every task. And, you know, Arvind's point that 95% of enterprises are using them for hard and easy tasks to check the weather, to spin up a dashboard. So what does that mean for them?

No, I think are they like, did they complain about you offering all the different models too? Because that hurts them ultimately. So for us, we work with all of them as like a neutral. And so if anything, a lot of what we do is because we have all these real world use cases of, okay, here are the actual tasks that big engineering orgs around the world are working on. We can give you the results and the evals on how your models are doing on every single one of these tasks. We actually work very closely with them and kind of a very nice like relationship in that sense.

I think the no, I mean, to your point, I think for sure that's, that's the case. But, but I think we still have such a long way to go with AI. And if I ask the question in my mind of, you know, how many folks out there, I mean, we talk about this all the time because we're here in San Francisco. Right? But how many of the folks out there in the world that could be using agents to do their work are actually doing it. I mean, the answer might be it's less than 1%. Right? And so, you know, that alone means there's, there's just so much room to grow. So like Jevons Paradox and the pie is just going to keep growing and growing for everyone, I guess, which is, yeah, that's something you hear often here. You said

something a few minutes ago, but I want to come back to it. Yeah. You said that's why it's so important for us to remain independent. Yeah. It is interesting that, I mean, cognition is sort of one of these labs, um, that has, you know, great reputation. And we're seeing some of the other ones sort of get taken out. Like cursor most notably has this agreement with X AI. A lot of folks think that they're going to be fully acquired. Why is it important for you guys to stay independent?

Yeah. No, I mean, for us, it's frankly, we think that being independent is actually the best way to be value aligned with our own customers. And we want to be able to point people to, here's where you're spending more than you need to. We're going to help you cut that down. Here's where you could be using this different model. We're going to help you use that. Right. And I think as a, you know, as a real transformation partner, you know, when you start to talk about how are we going to, you know, ten X the capacity of this whole org. How are we going to make sure that every engineer can build ten X more and do ten X more, and then make sure we get the actual return from that?

I think to some extent you need to be neutral, right? And that's how we've always thought about things. And, and that's, that's, that's where we always want to be. Have you found it off offers from labs? We've we've, we've we've heard some obviously, but look, it's it's honestly I mean, it's not really been interesting to us because I think from our perspective, there's, there's so much more for us to do as an independent. And that probably says a lot about where you think this is going, right. The idea that there's got to be like a bunch of models to choose from. Yeah. And it's interesting because it feels like there was a

few companies at the beginning of this modern AI era. Like I think about perplexity that basically did model routing from the very beginning. That was a feature. Yeah. Um, will we see more companies do this? Is that a trend that's already growing? I think we're going to see a lot more folks do this model routing. And I'm hoping that we're going to see a lot more folks thinking about these kind of productivity guarantees as well. And it's, you know, it's, I think it's good for both sides, honestly, because, um, because folks want to be able to commit to AI, they want to be able to use AI and, you know, it shouldn't be on them to bear the risk of what happens if AI is not as good as what they thought it

was or something like that. Right. And I think, um, you know, for, for us at the application layer to be able to provide that and do that, I think is, is, is just a win for everybody. Okay, last few questions, a bit of a lightning round, but I like to do this, especially with people who are kind of like model agnostic. Yeah. Um, what model are you using personally to do your hardest coding tasks? Right. Oh gosh. Um, it's, uh, you know, I really, I generally defer to Devin.

Um, I have to. Say, so Devin chooses. Devin. Devin will choose for. Me what's Devin choosing? But, but often it's going to be for the absolute hardest task. It's definitely it's going to be you know it's right now pretty much a 50/50. I would say between GPT 5.5 and opus 4.8. Okay. And so what you'll see is that for some of these like, uh, reasoning tasks, if you're figuring out a bug or something like that, you want to use GPT because that's what it's typically better on. If it's like basically navigating a bunch of flows, doing its own QA and testing, we typically see opus is better. What was that ratio a few months ago?

It's a good question. I mean, I think it's, um, a few months ago it was much more in favor of, of the anthropic models, I would say at the time. Um, and we've seen that shift a lot. Interesting. Okay. Your everyday tasks. Yeah. I don't know if you're using it to check the weather is a ridiculous proposition, but your everyday tasks. Well, it's it's I'm like you, I think where I have all the different apps and so I'm subscribed to everything. And so I'll pull up all the apps on my phone and just ask all them. And I kind of like to play them off of each other. And

you know, you'll see in their thinking change that it says, oh, like, you know, if I ask it, hey, hey, by the way, here's opus output. You know, I tell that to GPT. Can you take a look and see, see what you think about that? You know, you'll, you'll see the reasoning and says, hey, this is actually, this is one of my competitors. Like, I gotta, I gotta really look into this and see if there was something that was wrong here. And so it's kind of a nice, it's a fun thing to. You're, you're distrustful like me. I'm always like, what? I have to check. What do they all say? Are you is that like the main ones here?

Do you use meta AI. So I typically am using it, switching between Gemini Claude and GPT. Those are the three. Yeah, yeah. Um, okay. The last one I can't remember. Now. Usually I ask a last question on model usage. Um, I guess no, I think that's it. But you know what, Scott, it was so great talking to you. We covered a lot of ground. I know that you've got to run. Please come back again soon. Yeah. And remember too, to our audience that $10 million guarantee. Yeah. That's interesting. Who have you heard from after announcing that?

Oh. So I mean, we just put it out this morning and we've we've gotten a ton of inbound already from that. New customers too. Is that a bit of a sales. Both sides new and existing that have been interested to talk about it and see what they can do. So super excited about this. Okay. Well, thank you so much, Scott. Talk to you again soon. Thanks for coming in. Awesome. So that is the view from a company doing the routing.

Now I want to get the view from the inside the enterprise where these bills are actually landing. What are enterprise saying are actually paying them and seeing costs climb. My next guest runs product for one of the biggest infrastructure companies in the world. Which means he sees how thousands of companies are actually buying and deploying AI. Right now, Cisco President and Chief Product Officer Jeetu Patel, fresh from Cisco Live, by the way, thank you for making the trip here where the whole pitch was building the infrastructure for this agentic AI wave.

G2 welcome back. I think you're like our first repeat guest because you're so good. Wow. I'm so honored. And I actually just snuck in. I know. Could you tell I was reading a little bit slower to give you some extra time to come in? But you know what? This is YouTube. So we can play it. And PS we're getting some great questions. So do keep them coming in for the audience out there. Tell me about Cisco Live. I mean it was great from the outside. How was it on the ground. Cisco Live. You know we have this every year in Las Vegas. It's 20,000 of our closest customers come there. And we actually share with them the roadmap for

what's going on in the future. And this was the first time that we were able to actually get the entire platform pulled together from the silicon and optics all the way to the agents. And it was magical. Like we were, um, you know, there was so much of a buzz and, um, so, so one of the big things that people are concerned about, they're concerned about the fact that there's a constraint on infrastructure. I don't have enough infrastructure to go out and meet the needs of AI. That's one they're really concerned about trust. Are we going to have enough, um, can I trust delegation to these agents in an effective way so they can get my job done?

Well, due to your microphone, if you don't mind. Sorry. No. All good. That's why we do it. So the infrastructure trust. And then the third thing that they're really concerned about right now is tokenomics. You know, like, is the pricing going to be something that there was a great example one of our customers gave, they said, hey, just imagine that, um, uh, employees in the company use about $200 worth of tokens every week, 50 weeks a year. That's 10,000 bucks in tokens. You have 40,000 employees. That's $400 million. You have 90,000 employees. That's $900 million. That wasn't really. The starting price. Can you say that again?

Slowly, $200 per week. Okay. That doesn't seem like that. Doesn't seem like it's crazy, right? Trying to think of my open core bet. You probably use more than that, Deirdre. You'd be a very expensive. I get it gets shut off and I'm like, ah, I got to re-up it. Okay, so $200 a week, a week, and then you multiply that at how many 50 weeks? Let's say you take two weeks off a year, you maybe take four weeks. It's 48 weeks, roughly. It's 10,000 bucks a year. And right.

Per employee. And so that. Costs we have 90,000 employees. Are you seeing it within Cisco? Um, you're seeing it everywhere. The token usage is high, right. And we've actually started with engineering. And what we've done is we had budgeted for that. So we said we're going to make sure that we plan this out because we knew it would start skyrocketing. Did you budget enough? No, never. Nobody did. No. Okay. And by the way, that's a good thing because that means that people are using it. And the thing that when we talked, when I came here the last time we talked about it, there's three phases. You

have to get familiar. Yes. Then you have to get good. Then you have to get efficient. I just used your wording with Scott. I knew, I knew I got it from someone. That was from you? Yes. Like there has to be a certain amount of wastage to get your staff comfortable with it, which is really important if you want to succeed. Okay. But was it a conversation last year at all? Cisco Live? Was anyone talking about token economics or pricing? Token economics wasn't even a thing. No, it wasn't a thing then. And because at the time, the thing was, hey, can we have adoption? But what really changed the equation was agents,

right? Because agents are going to. And now what's going to get it even more exacerbated is when you have this recursive self-learning loop where these models are going to start self-improving. Think about it all day long. You're going to actually run evals, and then you're going to need to distill the model at night. That's a lot of token movement that happens. All the time around the clock. So for us as a business, it's great because we actually have an offering for Tokenomics all the way from the utilization of the GPU to the agent behavior.

We can look at everything and monitor it, and you can see this agent starting to go awry. So let's make sure that we kill the agent. So we've actually now got a tool for that. Oh, interesting. Um. You're making. We needed the podcast mikes we needed. Yes. That's okay. Um, and then, um, and then the second thing, um, that we also need to really kind of focus on beyond just tokenomics is how are people going to just constantly be adjusting to, um, you know, the infrastructure shortages because we are going through a networking supercycle right now, by the way, a really interesting dynamic that's only developed in the past few days is you're not just

going to have the largest models and the largest data centers that's going to be there. But one of the big things that's happening is AI is now starting to come close to you because there's this new category of computing with desktop computing. Wait. I want to get to that. But hold on, I want we're going to get to that because it's fascinating. You texted me and I said, we have to talk about this, but first I want to get to these tokenomics. And you said that even in your even at Cisco, you guys blew through your budget. We I would say that we were way over the budget and we've actually had to make sure that we adjust for it. But the thing is, is we are 30,000 engineers are we are now building products that are entirely written with AI.

Right. You were saying. Exactly. And so how did you adjust? Um, we've, we've had to make sure that we deprioritize costs elsewhere and make sure that we put it into tokens. Like, you know, you don't see us doing billboards on 101. Like we'd rather build product. So last week we had Arvind from Glennan and he brought up this interesting trade off. Tough. I'm sure tokens are humans. He's saying that a lot of companies are having to make way for token usage or decide between humans. How are you? I mean, I know Cisco has done layoffs. How do you think about that?

We actually the way that we've done layoffs, this is an important thing. It's our restructuring has not been for making way, for going out and putting it towards. Our restructuring has been we needed to make sure that we put dollars towards the most important places. And sometimes a lot of those people that were part of the restructuring we might hire in the different areas that we're talking about. So silicon is a big area of investment for us. So we're going to make sure that we move resources into silicon. Optics is another massive area of investment for us. So ours is much more of a movement of dollars from one to the other. We are also going through a very interesting time right now, because we are in the midst of a networking supercycle,

where the amount of bandwidth. You've got to serve others. And these agents are actually far more consumptive in network bandwidth than humans. We had the staff that was really interesting. An agent to conduct the same task as a human will take 450% more network bandwidth. Oh. Wow. Think about. That. So it's more expensive. So from a network bandwidth perspective, you're going to need to be you need you're going to need to accommodate for a far more consumptive agent than what a human was. And there are going to be trillions of agents. And now you're going to have agents everywhere. And so you will need to have this kind of network capacity that's proportionally adjusted to the demand signal.

So you guys at Cisco are finding ways to monitor waste and become more efficient in your own token usage. What about model routing? I feel like this week everyone's talking about that and it makes a lot of sense. Like you can just see the numbers cost you $25 for an output on clod opus under a dollar for the same output on deep sea. Are you guys utilizing that? Absolutely. We actually have three models ourselves. We have a deep network model. We have a foundation security model that we've open sourced and we've got a time series model.

We just also got another model for observability and we demoed that on stage that, you know, if it takes $0.12 for token costs that you would have for a certain task, you could take out 95% of the cost if you actually go with the local model. That's a small model. But what you have to do is you have to have that intelligent routing layer that says, for these prompts and these tasks, I'm going to go over here, and for these other tasks, I'm probably going to be able to go to something cheaper. Talk to me about that intelligent model routing layer. Did I say that right? Yeah. Um, because I know companies like factory AI and Scott was just talking about it with cognition. Do you need a third party to do that?

Or like, can you actually install it through your networking equipment? It depends on the kind of use cases. So I think a lot of companies will look for third parties to say, give me an intelligent model routing layer. Um, and you know, there's a lot of companies that are actually getting pretty well funded recently and all of that. For us, it is so specialized in what we're doing with cybersecurity. We have to do it. We're also a tech company and we kind of know this stuff well. So what we want to do is make sure that, in my mind, is a core piece of IP that differentiates you. Because if the ability to efficiently and economically generate tokens is a force

multiplier for your business, and it's a strategic moat. So for us, that's a strategic moat. We will build the intelligence. And you guys are unique in that sense, but are you seeing from your customers, from others in the enterprise space that they're becoming more comfortable with this model routing that, you know, they're going from open AI or anthropic models to do much of their work or all their work to choosing smaller, more efficient ones. Yeah. Because it's not only that you can, um, it's cheaper in some cases. It also might be better in some cases. And the combination always makes a lot of sense. So I feel like that's a foregone conclusion that

that intelligent routing layer will be a thing that will be very prominent in architectures as you move forward. Right. What does that mean for premium price? I mean, there's always going to be a place for them, right? But ah, is it going to be growing as quickly as we've seen over the last few years. I think there's Jevons Paradox that's going to be like constantly on supercharged right now. The lower the price goes, like the biggest risk we have as an industry is if this gets too expensive, where the cost of tokens is disproportionately higher

than the value it generates, then people pull back. So actually the cost per token going down is better for the model providers because when the cost goes down, people use it more. But your value is actually commensurate to what you're paying per token. Let me push back on that point. We actually don't know if the value is commensurate. And that's what I'm saying. It's not right. Now. Right. Yeah. So if it's not and if the price goes down, that's actually better because then people actually use it more.

Is that a problem though? I mean, do you think that will hit demand some of these questions over return on investment and value. I mean, Scott was just here basically saying that if you don't get enough value from our product, we're going to pay $10 million for your usage. It feels like everyone is scrambling to show that they're providing value. But this is a question too that is not new, right? That hit the AI trade, the public AI trade before it's kind of been off to the races recently. But there's some of that skepticism

creeping back in. I think there's a natural governor in this because what are AI companies and all of us wanting to do? We want to make sure that we're continuing to show growth. How is growth going to be showed in persistent demand signal where people continue to keep coming back to you? Why do they keep coming back? Because you show them value. If you don't show them value, they're not going to keep coming back. So in my mind, you don't need to be giving someone $10 billion or $10 million so that you're not showing value. Just show value and it will happen.

Do you think anyone cracks down on their budgets? Says, I'm not going to use this as much. I mean, look at what Uber did, right? Gave everyone I think you will have creative solves that will have to be made for the price increase on tokens not being commensurate with what the value is that's being generated. That's going to happen on an ongoing basis. And by the way, that is the reason why there's this whole new class of computing that is starting to get created that I feel is going to.

Let's go there now. This is. And you know what? There was someone in our audience, Jeremy, who asked this in a really smart way. Companies are eventually. And this is what you were talking about earlier on desk side computing, is that right? He says companies are eventually going to shift to running local models. Prices are getting too high for fortune 500 companies to swallow. We are already seeing starting to see the shift to this idea.

Yeah. It makes a lot of sense, doesn't it? Explain though what? Break this down for me because. I have agents, I want to use these agents. I'm going to actually have a Mac mini right next to me. That's my desk side computer. My laptop computer is where I'm working. My desk side computer is where the Mac mini is, where my agents are working. That means I now am doing a lot of processing locally.

I might have models that actually sit on that Mac mini that agent is using. Because they're getting smaller. And smaller, they're getting smaller and smaller. So my, if I can get a substantial percentage of the workload on those, boy, that's a great thing. What that does though, is it's going to require a lot more network bandwidth because those agents are going to generate, you know, kind of traffic that will have to go back and forth between the data center because these agents aren't just talking amongst themselves. They're going to talk amongst themselves on the Mac mini, but they'll also talk to, to someone in the cloud. And so I think you're going to have a, at this point, an agent workflow is a

routing challenge. It is a trust decision and it's a telemetry event. And the telemetry event being. I need to make sure that I'm showing the traceability of where that agent is going, what it's doing, and all those things together make this a coordination of intelligence. It's not just a. It's not just that you're going to have one large model and one large data center, and it's going to answer all your questions. I think you're going to have this massive coordination because you're going to have models on your phone. You're going to have models on your desk side, you're going to have models at the edge, you're going to have

models in the data center, and they're all going to work in concert with one another. So I guess what I was going to ask you is, what does that mean for the infrastructure buildout? Right? If everyone's running their own models on their own machines, but I think I know what you're going to say. Jevons paradox. Yeah. And by the. Way, and everything's just going to grow. Everything's going to grow. You're going to see very, very different areas where like, for example, let's just take someone like Nvidia.

What does it mean for Nvidia? They're selling a lot of GPUs in, um, in data centers. Well, now they're going to sell a lot of kind of dust side computers. What does it mean for someone like Cisco? Well, Cisco campus and branches, which used to be historically pegged to the rate of inflation for growth to 3% growth a year. We grew 25% last quarter. Why is that? Because you're now starting to see that the offices are starting to need a going to start to need a lot more bandwidth because agents are actually working on the desk side.

So I think what you'll see is the cost per token will go down, the usage will go up and um, and the, the traffic, the traffic of agents will be distributed all over. You won't have token generation just in the data center. You guys did something really interesting at Cisco Live. Um, and that is like, we. Missed you by the. Way. I know. Wish I could have been there next year maybe. Um, this idea that like, and this is kind of happening like under the surface, but there's similar announcements like Robin Hood is letting agents trade.

You guys are letting agents do more of the actions. Are we at a point where we can trust agents to do all of that? I mean, there's different tasks. Those were two very different ones. I know you guys are making sure that's secure and security is part of your business, but what do you make of this whole trend? Are we moving too quickly? Perhaps we're actually. No, we're not moving too quickly. In fact, what's important is we start moving because if you don't start moving, you don't get the data. So here's what needs to happen. There is a technological

issue, which is, are these agents behaving the way that you want them to behave so that you can completely, 100% autonomously delegate a task to the agent? That's number one. But number two is do you as a human, trust the agent enough that you feel you're going to delegate the trust to them? That's the issue of psychology. People right now. Not, not universally. So what we've done is we've said you can delegate a task and then there'll be a human in the loop at checkpoints. But at any point in time, when you start to get enough evidence that this agent's doing well, you can make sure that you can make that fully autonomous. And that decision should be made individually by every network operating person or

kind of network operations person or security operations person. You don't need to go out and drive that for them. That's a decision they can make, and they can make it based on the class of the task, the class of the workflow. And then they decide. So just back to model routing for a moment. Was that a big point of discussion at Cisco Live? Yes. Because we were presenting on it. But were your customers already talking about it, asking about it? They were because they're concerned about tokens and the cost of tokens. Right. So they already know that. They already know that this is going to be an issue. The infrastructure is going to be expensive. And so they're

like, hey, what do we need to do? And so we said, we need to make sure that we make it meaningfully cheaper for you. So on the infrastructure side, where they come to us, because these are enterprise customers coming to us for the infrastructure, like we want to use AI, but is it going to be too expensive? And we're like, well, no, because if you want to observe every action that's going on, we can make it much, much more efficient for you by having this intelligent routing layer. And we've actually got a big AI research team now that's starting to focus on building these small models. And that's an area that I've invested in.

Deirdre, we've invested in over the course of the past year, year and a half in a big way, because we knew that this would eventually be something that people would want to. I know that Cisco and I've talked to you in the past about this. You're hesitant. You don't use the Chinese models because you have a security business, a networking business. What about customers though? Are you offering those lightweight, cheaper but very capable models to them? We're not offering Chinese models to our customers, but I think.

Even locally hosted. No, we're not doing that right now because I feel like a lot of our customers are very kind of regulated industry security conscious. So like they're like, hey, we want you to build Cisco. And so when we have an observability product, they're like, what model do you have? And we have a deep networking product. What models can we provide them that actually make it easier for them to get complicated questions around an infrastructure be responded to by a model? And so we've built those over the course of the past year and a half, two years. And they're working really well. In fact, some of our models that are like 8 billion parameter models that we've pre-trained are performing

better than in some cases in some benchmarks, better than 120 billion parameter model. So that's exciting to see. Are they competitive with the open source ones that are really high on the benchmarks? The models we build are special purpose models with special tasks. Like if you have a security model, right? So do you leave something on the table? Then if model routing is going to touch like presumably almost every company in the world, right? That's using AI. Do you leave something on the table by not offering those kinds of solutions. By not offering models for every generic task?

Yeah. The really cheap ones that can that are good enough. You know, the way I think about this is we have to be super focused on where we can be the world's best. We can be the world's best at networking. We can be the world's best at security, world's best at observability, and world's best on the data platform. That's what we're going to do. And then everything that we do is in service of that rather than getting defocused. Okay. Good answer. Okay. I remembered my three questions, the lightning round that I just did with Scott, but I forgot the third one. Okay. For your hardest coding tasks right now, what model are you using personally?

Um, I think we right now in the company are seeing, you know, GPT five is great and opus four eight is doing pretty well as well. So those are the two that tend to be the frontrunners. I, I'd like to see more competition, but I think these are the ones that actually come up to the top right. Has that changed over the last few months? I mean, GPT five just came out, but I guess like even a codex. Versus prior to GPT five. Um, opus was better, but I think GPT five has actually really caught up quite a bit. So I think you're seeing a neck and neck competition between

those two. And it's actually great because we have a great partnership with OpenAI. We have a great partnership with anthropic. They both do a great job in deploying forward, having forward deployed engineers and companies like ours. Similar to what Scott said. Actually, on that note too, I'm curious. I've seen sort. Of. Did he find the same? He did say the same. He said it's about 50/50. Now he uses both. And a few months ago that. We're seeing the. Same in Claude's favor. Um, do you think that they're going to have to start charging more for usage?

It feels like this era of subsidy, um, is changing a little bit. You saw anthropic sort of clamp down a little bit and then get all of that colossus capacity. Openai still seems to be offering a lot of upping their limits and so on. Does that change? I think they will have to get more efficient with the use of it, rather than charging more for the models. And by the way, you will always have larger. The models get like mythos is a 10 trillion parameter model. Like you're going to charge more for that. But I do feel like there's going to be a concerted effort that the industry takes in saying, what do we need to do to make sure that these models are more

efficient in token generation? Right? So what model are you using for your everyday tasks? I use three and I use and I actually put them against each other as a critique this one. So I have ChatGPT. I have GPT five. I use opus four eight for from anthropic, and I use Gemini and for different tasks on a daily basis. You might see that for extreme current events, Gemini is great, you know, um, for, uh, for reasoning tasks, you might see in some cases, like opus does better. In some cases, GPT five does better. For some reason. I always use Gemini for cooking questions. I don't know why.

I just always go to it for like, my basic questions that I don't want to mess up the memory on the other two. I, I don't know why. Because I don't cook. I don't use Gemini in that use case, but I do use Gemini a lot for like current events news, right? It's almost like news or using search less. So you're just using Gemini. I almost, yeah. I, I don't think I can't remember the last time I used search. I don't even know when I use search the last time. Right. And if you do, you're probably. When do you use search? The last time I used it, I still use it. I use Google News too, but I just get AI mode. So it's kind of the same thing you're getting. But you have to.

Google it or not. You have to give Google a lot of credit. I think what Demis and Sundar and those folks have done deserves a lot of accolades because they are full stack all the way through. But they don't have the coding product. It feels like it's all about coding right now. They are going to need to catch up on that. Yeah. Um, you know, it's funny, I've been talking to you and Scott, we've been talking about all the sort of major models and labs we haven't talked once about Xai or grok and the space X, you know, IPO is upon us. Not because I don't think they're formidable. I actually think they will.

My God, Elon has such a massive role that he's going to play in the AI space. Everything from orbital kind of orbital data centers, those are going to be huge. The fact that he has, um you know, the Colossus data centers and he's given the Colossus one to anthropic has actually really helped the industry out. Um, I think he's going to continue to keep building capacity out. He'll be a very formidable hyperscaler at some point in time, in my opinion. Hyper scale is so interesting that you said that he's going to be a formidable Hyperscaler.

Is he going to be? I think model company, they already are a great model company, but it's just a different. They don't have the coding piece either. But that's not the only that I mean, firstly, what they do with cursor could meaningfully change that. Yeah. So I do feel like there's, there's a huge upside for him to have with, with, with cursor. But in addition to that, like, if you look at the way that X uses grok, I mean, it's yeah, it's pretty dope. Certainly a use case. Pretty dope. I like that. We'll end on that G2. Wonderful to sit down and talk to you. Great to see you again. We'll talk again soon. Thank you guys for watching. Thank you to Sami in the control room. Jasmine on the production side.

Robert and Evan over there behind the cameras. Join us next week. Thanks again.

Gettsly is free, without subscription fees or ads, and available to everyone. Your support helps us keep the service online, improve its features, and continue providing useful video tools.

5.00 USD

More Tech Transcript

Motorola Razr Fold Review A Thin Book Style Foldable That Impresses

Motorola Razr Fold Review A Thin Book Style Foldable That Impresses

Netgear vs TP-Link Wi-Fi 7 Routers Expose Security and Firmware Gaps

Netgear vs TP-Link Wi-Fi 7 Routers Expose Security and Firmware Gaps

Why AI Gunshot Detection Fails and Chicago Finally Ended Its ShotSpotter Contract

Why AI Gunshot Detection Fails and Chicago Finally Ended Its ShotSpotter Contract

Intel Core Ultra 3 Series Laptop Chips Finally Rival Apple M1 Performance

Intel Core Ultra 3 Series Laptop Chips Finally Rival Apple M1 Performance

Xiaomi 17T Review A Mid-Range Phone with a Telephoto Camera That Goes the Distance

Xiaomi 17T Review A Mid-Range Phone with a Telephoto Camera That Goes the Distance

Noctua Endorses Carbon Nanotube Thermal Pads as Long-Term Alternative to Paste

Noctua Endorses Carbon Nanotube Thermal Pads as Long-Term Alternative to Paste